Information processing system, arithmetic processing circuit, and control method for information processing system

ABSTRACT

An arithmetic processing circuit includes, a dividing circuit that divides a plurality of data blocks into groups of a number equal to the number of arithmetic processing circuits included in an information processing apparatus, a data selecting circuit that selects respective first data blocks from the plurality of data blocks included in the respective groups, a transmission destination selecting circuit that selects arithmetic processing circuits different from each other as respective transmission destinations from the plurality of arithmetic processing circuits for the respective first data blocks selected by the data selecting circuit based on destination number information obtained by exclusive disjunction operation on identification number information assigned to each arithmetic processing circuit and cyclic number information assigned to each group, and a transmitting circuit that transmits the respective first data blocks selected by the data selecting circuit to the respective arithmetic processing circuits selected by the transmission destination selecting circuit.

CROSS-REFERENCE TO RELATED APPLICATION

This application is based upon and claims the benefit of priority of theprior Japanese Patent Application No. 2017-214029, filed on Nov. 6,2017, the entire contents of which are incorporated herein by reference.

FIELD

The embodiments discussed herein are related to an informationprocessing system, an arithmetic processing circuit, and a controlmethod for an information processing system.

BACKGROUND

A parallel computation system formed by coupling a large number ofcomputers referred to as nodes is often used in a field of highperformance computing (HPC). A node may be, for example, one chip set orthe like. In recent years, parallel computer systems have been used alsofor deep learning or the like.

There are a mesh connection and a torus connection as forms of couplingof nodes in parallel computation systems. The mesh connection is a formof coupling in which nodes are arranged in the form of a mesh in aplurality of axial directions, and nodes adjacent to each other in eachof the axial directions are coupled to each other by a high-speednetwork referred to as an interconnect. The torus connection is a formof coupling in which the mesh connection is made, and then nodes at bothends on each of the axes are coupled to each other. There are alsonetworks where all of the axes have the mesh connection or the torusconnection and forms of coupling such that a part of the axes have themesh connection and the other axes have the torus connection. Forexample, parallel computation systems include devices having a topologyas a six-dimensional torus structure.

Further, a parallel computation system may adopt a configuration thatincludes a plurality of system boards each having a plurality of nodesmounted thereon. A coupling between nodes arranged on a same systemboard is established by a high-speed dedicated interconnect. On theother hand, a coupling between nodes arranged on different system boardsis established via a network switch using peripheral componentinterconnect (PCI) and InfiniBand (registered trademark). Here, thecoupling between the nodes within the same system board will be referredto as an “inside coupling,” and the coupling between the nodes via thenetwork switch between the different system boards will be referred toas an “outside coupling.” The inside coupling, which is established bythe dedicated interconnect, has a wide bandwidth as compared with theoutside coupling using PCI and InfiniBand, and thus enablescommunication at high speed.

Then, each of the nodes of the parallel computer system processes aprogram used in solving a complex problem at high speed. For example,the parallel computer system divides a job as an executable unit of theprogram into a plurality of processes, and allocates the dividedprocesses to the respective nodes. Here, the processes are a program inwhich each node actually performs arithmetic processing. When each nodeobtains a process, the node performs arithmetic processing of theobtained process. When each node completes the arithmetic processing ofthe process, the node transmits an arithmetic result to a managementserver, and ends the arithmetic processing. In addition, the parallelcomputer system transmits a new process to the node ending thearithmetic processing, and makes the node perform arithmetic processing.Then, the parallel computer system integrates the results of thearithmetic processing performed by the respective nodes on themanagement server, and obtains an arithmetic result of the whole of thejob.

The parallel computation system may perform the processing of Allreducein such arithmetic processing. Allreduce is processing of integratingvalues calculated by respective processes, and sharing, in all of theprocesses, a result obtained by performing an operation using theintegrated values. In this case, each node performs group communication.When the group communication is performed, the process performed by eachnode retains the arithmetic result of the values possessed by all of theprocesses. Thus, when the processing of Allreduce is performed, eachnode obtains the values possessed by all of the other nodes. However, anetwork load is increased when the value possessed by each of the nodesis transmitted to all of the other nodes, for example, in the processingof Allreduce.

It is therefore desirable to reduce communication data amounts betweenthe nodes when the processing of Allreduce is performed. Accordingly, aHalving+Doubling method is proposed as a technology of reducing thecommunication data amounts in Allreduce. Halving+Doubling may bereferred to also as Reduce_scatter+Allgather.

When a Halving operation in the Halving+Doubling method is performed, acommunication data amount is halved in each communication step. When aDoubling operation is performed, on the other hand, the communicationdata amount is doubled in each communication step. For example, in theHalving+Doubling method, performing Halving after a start of processingreduces the communication data amount as the step advances, andsubsequently performing Doubling increases the communication data amountas the step advances. Therefore, in the Halving+Doubling method, mutualcommunication of a large amount of data is performed during a smallnumber of steps, mutual communication of a small amount of data isperformed in the middle of steps, and thereafter mutual communication ofa large amount of data is performed as steps are increased.

Here, as described above, in the parallel computation system having theinside coupling and the outside coupling, the outside coupling has anarrow bandwidth, and therefore the data amount of data transmitted andreceived in the outside coupling is desirably small. Accordingly, whenthe processing of Allreduce is performed in the parallel computationsystem having the inside coupling and the outside coupling, it isdesirable to perform communication in the outside coupling afterreducing the size of the data as much as possible in the insidecoupling. For example, the processing of Allreduce may be performed bythe following method. First, the amount of data transmitted and receivedis reduced by performing Halving in the inside coupling, and thereafterAllreduce processing of data in the outside coupling is performed.Thereafter, the amount of data handled is increased by performingDoubling in the inside coupling, and the processing of Allreduce iscompleted.

When such Allreduce processing is performed, the form of coupling of thenodes desirably has a connection forming a hypercube. An n-dimensionalhypercube has the following features. The n-dimensional hypercube isconstituted of 2^(n) nodes. Then, each node has n links. Further, when abinary index is assigned to each node, each node is adjacent to andcoupled to nodes different from the node in the value of one bit in thebit strings of the assigned indexes. For example, in the case where thenodes have a form of coupling constituting a hypercube, it is easy toidentify data transmission destinations, and a processing load isreduced because data transmission and reception in the case ofperforming the processing of Allreduce becomes easy.

Further, as a technology of group communication in a parallel computersystem, there is a technology that calculates an entire processing time,switches between an entire data communication and a partial datacommunication so as to select a shorter processing time, and performsthe communication.

A related technology is disclosed in Japanese Laid-open PatentPublication No. 2001-325239.

SUMMARY

According to an aspect of the embodiments, an arithmetic processingcircuit includes, a dividing circuit that divides a plurality of datablocks into groups of a number equal to the number of arithmeticprocessing circuits included in an information processing apparatus, adata selecting circuit that selects respective first data blocks fromthe plurality of data blocks included in the respective groups, atransmission destination selecting circuit that selects arithmeticprocessing circuits different from each other as respective transmissiondestinations from the plurality of arithmetic processing circuits forthe respective first data blocks selected by the data selecting circuitbased on destination number information obtained by exclusivedisjunction operation on identification number information assigned toeach arithmetic processing circuit and cyclic number informationassigned to each group, and a transmitting circuit that transmits therespective first data blocks selected by the data selecting circuit tothe respective arithmetic processing circuits selected by thetransmission destination selecting circuit.

The object and advantages of the invention will be realized and attainedby means of the elements and combinations particularly pointed out inthe claims. It is to be understood that both the foregoing generaldescription and the following detailed description are exemplary andexplanatory and are not restrictive of the invention.

BRIEF DESCRIPTION OF DRAWINGS

FIGS. 1A and 1B illustrate diagrams of a hardware configuration of aninformation processing system according to a first embodiment;

FIG. 2 is a diagram illustrating various kinds of hypercubes;

FIGS. 3A and 3B illustrate diagrams of a hardware configuration of anarithmetic processing circuit;

FIGS. 4A and 4B illustrate block diagrams of a network control circuit;

FIG. 5 is a diagram illustrating an outline of a flow of processing inHalving;

FIG. 6 is a diagram illustrating an outline of a flow of processing upto Allreduces with an external topology;

FIG. 7 is a diagram illustrating an outline of a flow of processing inDoubling;

FIG. 8 is a diagram of assistance in explaining a method of determiningdata blocks for transmission and reception on a thread Th1 side;

FIG. 9 is a diagram illustrating a whole of Halving for a thread Th1 byfour arithmetic processing circuits included in a two-dimensionalhypercube;

FIG. 10 is a diagram of assistance in explaining a method of determiningdata blocks for transmission and reception on a thread Th2 side inHalving;

FIG. 11 is a diagram illustrating Halving for a thread Th2 by fourarithmetic processing circuits included in a two-dimensional hypercube;

FIG. 12 is a diagram of assistance in explaining destinationdetermination processing according to the first embodiment;

FIG. 13 is a diagram of assistance in explaining destination transitionsin a two-dimensional hypercube;

FIGS. 14A to 14C illustrate flowchart of a whole of processing ofAllreduce;

FIG. 15 is a flowchart of Halving;

FIG. 16 is a flowchart of Doubling;

FIG. 17 is a diagram illustrating an example of pseudocode of a programfor performing Halving and Doubling;

FIGS. 18A and 18B illustrate flowchart of processing performed in aprogram implementing Halving;

FIGS. 19A and 19B illustrate flowchart of processing performed in aprogram implementing Doubling;

FIG. 20 is a diagram of assistance in explaining effects of processingof Allreduce according to the first embodiment;

FIGS. 21A and 21B illustrate diagrams of a hardware configuration of aninformation processing system according to a second embodiment;

FIG. 22 is a diagram of assistance in explaining destinationdetermination processing according to the second embodiment;

FIG. 23 is a diagram of assistance in explaining destination transitionsin a three-dimensional hypercube;

FIG. 24 is a diagram illustrating arithmetic processing circuitsperforming communication in each communication in an informationprocessing system according to the second embodiment;

FIGS. 25A and 25B illustrate diagrams of a hardware configuration of aninformation processing system according to a third embodiment; and

FIG. 26 is a diagram illustrating arithmetic processing circuitsperforming communication in each communication in an informationprocessing system according to the third embodiment.

DESCRIPTION OF EMBODIMENTS

However, when Halving+Doubling is simply used for a topology having ahypercube, there may be a path not used in communication in each step.Therefore, with a method of simply using Halving+Doubling, a band is notfully used, and thus it is difficult to improve the speed of Allreduce.

In addition, even when the technology is used which switches betweenentire data communication and partial data communication, and performsthe communication, a path not used in communication in each step occurs,so that it is difficult to improve the speed of Allreduce.

Embodiments of an information processing system, an arithmeticprocessing circuit, and a control method of the information processingsystem disclosed in the present application will hereinafter bedescribed in detail with reference to the drawings. It is to be notedthat the following embodiments do not limit the information processingsystem, the arithmetic processing circuit, and the control method of theinformation processing system disclosed in the present application.

First Embodiment

FIGS. 1A and 1B illustrate diagrams of a hardware configuration of aninformation processing system according to a first embodiment. Theinformation processing system according to the present embodimentincludes two system boards 1. Each of the system boards 1 includes acentral processing unit (CPU) 2, two PCI switches 3, and four hostchannel adaptors (HCAs) 4.

The CPUs 2 on the respective system boards 1 are directly coupled toeach other. In addition, each of the CPUs 2 on the system boards 1 iscoupled to two PCI switches 3. The CPU 2 receives an instruction of ajob input from an operator, and transmits the job to each arithmeticprocessing circuit 20 to be described later.

Each PCI switch 3 is coupled to the CPU 2 and two arithmetic processingcircuits 20. In addition, each PCI switch 3 is coupled to two HCAs 4.The PCI switch 3 performs path selection at a time of communicationusing PCI by the CPU 2 and the arithmetic processing circuits 20.

The HCAs 4 are communication interfaces in communication usingInfiniBand by the CPU 2 and the arithmetic processing circuits 20. TheHCAs 4 are coupled to the PCI switches 3. In addition, the HCAs 4 arecoupled to a network switch 5.

The network switch 5 is coupled with the HCAs 4 on each of the systemboards 1. The network switch 5 performs path selection in communicationusing InfiniBand by the CPU 2 and the arithmetic processing circuits 20.

A sub-net manager 6 manages communication paths of communicationperformed via the network switch 5. As an example, the sub-net manager 6operates on one server on a network, and periodically performs a pathupdate.

Each of the arithmetic processing circuits 20 includes an arithmeticcircuit for performing parallel computation. The arithmetic processingcircuit 20 is, for example, a graphic processing unit (GPU). Thearithmetic processing circuit 20 performs arithmetic processing in deeplearning or the like. The arithmetic processing circuit 20 mayhereinafter be referred to as a “node.” The arithmetic processingcircuit 20 corresponds to an example of an “arithmetic processingcircuit.”

Four arithmetic processing circuits 20 are arranged on one system board1. Then, an arithmetic processing circuit 20 is coupled to two adjacentarithmetic processing circuits 20 by an interconnect. A path representedby a thick solid line in FIGS. 1A and 1B represents the interconnect.

The four arithmetic processing circuits 20 each coupled to two adjacentarithmetic processing circuits 20 are coupled so as to form atwo-dimensional hypercube. The hypercube will be described in thefollowing. FIG. 2 is a diagram illustrating various kinds of hypercubes.A configuration HQ1 is a one-dimensional hypercube. A configuration HQ2is a two-dimensional hypercube. A configuration HQ3 is athree-dimensional hypercube. A configuration HQ4 is a four-dimensionalhypercube.

In a case where a binary index having n bits is assigned to each node,an n-dimensional hypercube is formed by coupling nodes different fromeach other in the value of one bit in bit strings representing theindexes to each other. For example, a number added to each node in theconfigurations HQ1 to HQ3 in FIG. 2 is a binary index assigned to eachnode such that each node is coupled to nodes different from the node inthe value of one bit in the bit strings. For example, as illustrated inthe configurations HQ1 to HQ3 in FIG. 2, in the n-dimensional hypercube,intercommunication is performed between nodes different from each otherin the value of one bit in the bit strings representing the assignedindexes.

As an example, arithmetic processing circuits 20 according to thepresent embodiment are coupled so as to form a two-dimensionalhypercube. Accordingly, when indexes of 00, 01, 10, and 11 are assignedto four arithmetic processing circuits 20, communication is performed asfollows. The arithmetic processing circuit 20 having the index of 00communicates with the nodes having the indexes of 10 and 01. Inaddition, the arithmetic processing circuit 20 having the index of 01communicates with the arithmetic processing circuits 20 having theindexes of 00 and 11. The arithmetic processing circuit 20 having theindex of 10 communicates with the nodes having the indexes of 00 and 11.In addition, the arithmetic processing circuit 20 having the index of 11communicates with the arithmetic processing circuits 20 having theindexes of 10 and 01.

In addition, while in the present embodiment, description is made of acase where arithmetic processing circuits 20 are coupled so as to form atwo-dimensional hypercube, hypercubes of three dimensions or more mayalso have functions similar to functions to be described in thefollowing. Arithmetic processing circuits 20 directly coupled to aparticular arithmetic processing circuit 20 without the intervention ofanother arithmetic processing circuit 20 among arithmetic processingcircuits 20 coupled so as to form a hypercube correspond to an exampleof arithmetic processing circuits 20 “adjacent” to the particulararithmetic processing circuit 20. The number of the arithmeticprocessing circuits 20 directly coupled to the particular arithmeticprocessing circuit 20 without the intervention of another arithmeticprocessing circuit 20 corresponds to the dimensions of the hypercube.

FIGS. 3A and 3B illustrate diagrams of a hardware configuration of anarithmetic processing circuit. The arithmetic processing circuit 20includes parallel arithmetic circuits 21, arithmetic control circuits22, memory control circuits 23, and memories 24. The arithmeticprocessing circuit 20 further includes a direct memory access (DMA)engine 25, a PCI control circuit 26, a job managing circuit 27, anetwork configuration managing circuit 28, a network control circuit 29,a communication buffer 30, and an interconnect circuit 31. Here, whileFIGS. 3A and 3B includes four parallel arithmetic circuits 21, fourarithmetic control circuits 22, four memory control circuits 23, andfour memories 24, the numbers thereof are not particularly limited.

The PCI control circuit 26 controls communication using PCI between theDMA engine 25, the job managing circuit 27, and the networkconfiguration managing circuit 28 and the PCI switch 3.

The job managing circuit 27 receives input of an operation instructionfor performing a job transmitted from the CPU 2 via the PCI switch 3 andthe PCI control circuit 26. The job managing circuit 27 handles theobtained operation instruction as a queue. The job managing circuit 27outputs the operation instruction to the arithmetic control circuits 22.Thereafter, the job managing circuit 27 receives input of operationresults from the arithmetic control circuits 22. Then, the job managingcircuit 27 integrates the operation results obtained from eacharithmetic control circuit 22, and thereby obtains a result of the job.The job managing circuit 27 thereafter outputs the result of the job tothe CPU 2 via the PCI control circuit 26 and the PCI switch 3 or thelike.

The arithmetic control circuits 22 obtain data from the memories 24 viathe memory control circuits 23 according to the operation instruction.Then, the arithmetic control circuits 22 output the obtained data andthe operation instruction to the parallel arithmetic circuits 21.Thereafter, the arithmetic control circuits 22 obtain operation resultsfrom the parallel arithmetic circuits 21. Next, the arithmetic controlcircuits 22 output the operation results to the job managing circuit 27.

The parallel arithmetic circuits 21 receive input of the data and theoperation instruction from the arithmetic control circuits 22. Then, theparallel arithmetic circuits 21 perform specified operation using theobtained data. Thereafter, the parallel arithmetic circuits 21 outputoperation results to the arithmetic control circuits 22.

The network configuration managing circuit 28 obtains a connection tablewithin the system board 1 from a device driver executed by the CPU 2.The connection table includes a state of interconnect coupling betweenthe arithmetic processing circuits 20. The network configurationmanaging circuit 28 outputs the obtained connection table to the networkcontrol circuit 29.

The network control circuit 29 is coupled to the communication buffer 30and the interconnect circuit 31. The network control circuit 29 controlscommunication with another arithmetic processing circuit 20 via theinterconnect by using the communication buffer 30 and the interconnectcircuit 31. For example, the network control circuit 29 writes data tobe transmitted to the communication buffer 30. Then, the network controlcircuit 29 transmits the data to the other arithmetic processing circuit20 by instructing the interconnect circuit 31 to transmit the datastored in the communication buffer 30.

The network control circuit 29 obtains the connection table of thesystem board 1 from the network configuration managing circuit 28.Further, the network control circuit 29 obtains, from the job managingcircuit 27, information related to communication between the arithmeticprocessing circuits 20, the communication being performed in the jobbeing performed. Then, the network control circuit 29 determines acommunication destination and data to be transmitted by using theconnection table according to the obtained information related to thecommunication. Next, the network control circuit 29 outputs a request toobtain the data determined for transmission to the memory controlcircuits 23. Thereafter, the network control circuit 29 receives inputof the data in response to the obtainment request from the memorycontrol circuits 23. Then, the network control circuit 29 transmits theobtained data to the determined communication destination.

In addition, the network control circuit 29 receives a notification ofreception of data from the interconnect circuit 31. Then, the networkcontrol circuit 29 obtains the data received from another arithmeticprocessing circuit 20 and stored in the communication buffer 30. Then,the network control circuit 29 outputs a writing instruction to thememory control circuits 23 together with the obtained data.

The communication between the arithmetic processing circuits 20, thecommunication being performed by the network control circuit 29,includes the processing of Allreduce. The processing of Allreduce viathe interconnect by the network control circuit 29 will be describedlater in detail.

The communication buffer 30 is a temporary storage area in communicationvia the interconnect between the arithmetic processing circuits 20. Thecommunication buffer 30 stores data to be transmitted to otherarithmetic processing circuits 20. In addition, the communication buffer30 stores data received from the other arithmetic processing circuits20.

The interconnect circuit 31 couples to the interconnect circuits 31 ofthe other arithmetic processing circuits 20 by the interconnect. Theinterconnect circuit 31 performs communication via the interconnect withthe other arithmetic processing circuits 20. The interconnect circuit 31receives a data transmission instruction from the network controlcircuit 29, and reads data from the communication buffer 30. Then, theinterconnect circuit 31 transmits the read data to an arithmeticprocessing circuit 20 as a communication destination specified from thenetwork control circuit 29. In addition, when the interconnect circuit31 receives data from another arithmetic processing circuit 20, theinterconnect circuit 31 stores the received data in the communicationbuffer 30. Further, the interconnect circuit 31 notifies the receptionof the data to the network control circuit 29.

The DMA engine 25 controls access to the memories 24 of anotherarithmetic processing circuit 20 coupled by a PCI bus without theintervention of the CPU 2 or the like. The DMA engine 25 receives inputof information related to communication using PCI in the job from thejob managing circuit 27. Then, according to the information related tothe communication using PCI in the job, the DMA engine 25 instructs thememory control circuits 23 to read data. Thereafter, the DMA engine 25receives, from the memory control circuits 23, input of the read datafrom the memories 24. Then, the DMA engine 25 transmits the obtaineddata to the memories 24 of the arithmetic processing circuit 20 as atransmission destination via the PCI control circuit 26 and the PCIswitch 3.

In addition, the DMA engine 25 receives, from the PCI control circuit26, data transmitted from the DMA engine 25 of another arithmeticprocessing circuit 20 by DMA. Then, the DMA engine 25 instructs thememory control circuits 23 to write the received data.

The memory control circuits 23 control reading and writing of data fromand to the memories 24 coupled thereto. The memory control circuits 23read data from the memories 24 according to a reading instruction fromthe DMA engine 25, and output the read data to the DMA engine 25. Inaddition, the memory control circuits 23 write data to the memories 24according to a writing instruction from the DMA engine 25.

In addition, the memory control circuits 23 read data from the memories24 according to a reading instruction from the network control circuit29, and output the read data to the DMA engine 25. In addition, thememory control circuits 23 write data to the memories 24 according to awriting instruction from the network control circuit 29.

Description will next be made of details of the processing of Allreduceby the network control circuit 29. The arithmetic processing systemaccording to the present embodiment performs the processing of Allreduceby performing a Halving operation and a Doubling operation. In thefollowing, the Halving operation will be referred to simply as“Halving,” and the Doubling operation will be referred to simply as“Doubling.” Further, the processing of Allreduce will be referred tosimply as “Allreduce.”

FIGS. 4A and 4B illustrate block diagrams of a network control circuit.As illustrated in FIGS. 4A and 4B, the network control circuit 29includes a process identification (ID) allocating circuit 201, a threadgenerating circuit 202, a centralized control circuit 203, atransmission and reception data size calculating circuit 204, a targetdata determining circuit 205, a destination determining circuit 206, anda data transmitting circuit 207. The network control circuit 29 furtherincludes a data receiving circuit 208 and a synchronization processingcircuit 210.

Here, an arithmetic circuit possessed by the network control circuit 29implements functions of the process ID allocating circuit 201, thethread generating circuit 202, the centralized control circuit 203, thetransmission and reception data size calculating circuit 204, and thetarget data determining circuit 205 illustrated in FIGS. 4A and 4B. Inaddition, the arithmetic circuit implements functions of the destinationdetermining circuit 206, the data transmitting circuit 207, the datareceiving circuit 208, and the synchronization processing circuit 210.

In addition, as illustrated in FIG. 3B, the network control circuit 29actually communicates with the other arithmetic processing circuits 20via the communication buffer 30 and the interconnect circuit 31.However, for the convenience of description, the description will bemade with the communication buffer 30 and the interconnect circuit 31omitted. In the following description, an arithmetic processing circuit20 to be described will be referred to as an own arithmetic processingcircuit to be distinguished from the other arithmetic processingcircuits 20.

The process ID allocating circuit 201 obtains the connection table fromthe network configuration managing circuit 28. Then, using theconnection table, the process ID allocating circuit 201 allocatesprocess IDs whose number is up to 2^(n)−1 counting from zero to the ownarithmetic processing circuit and the other arithmetic processingcircuits 20, the other arithmetic processing circuits 20 being coupledso as to form a hypercube together with the own arithmetic processingcircuit. “n” in this case is the dimensions of the hypercube. Theprocess IDs correspond to an example of “identification numberinformation” of the arithmetic processing circuits.

Here, the process ID allocating circuit 201 allocates the process IDs bya rule similar to that of the indexes allocated to the respective nodesillustrated in FIG. 2. For example, the arithmetic processing circuits20 according to the present embodiment are coupled so as to form atwo-dimensional hypercube, and the process ID allocating circuit 201allocates 00, 01, 10, and 11 as process IDs to the respective arithmeticprocessing circuits 20. In this case, the process ID allocating circuit201 allocates process IDs different from each other in the value of onebit to the adjacent arithmetic processing circuits 20 coupled by theinterconnect. FIG. 5 is a diagram illustrating an outline of a flow ofprocessing in Halving. The process ID allocating circuit 201 allocatesthe process IDs illustrated in FIG. 5 to the respective arithmeticprocessing circuits 20. Here, eight boxes arranged in a verticaldirection facing a paper plane in FIG. 5 represent blocks, and a set ofeight blocks corresponds to a data string possessed by one process.Then, numbers within respective blocks in FIG. 5 represent data stringscalculated by the respective processes. Here, in FIG. 5, data stringscorresponding to respective blocks included in one process arerepresented by a same number. This indicates that the process performingoperation is the same. In actuality, data strings corresponding torespective blocks have respective different values.

The process ID allocating circuit 201 thereafter outputs informationrelated to the process IDs allocated to the respective arithmeticprocessing circuits 20 to the centralized control circuit 203. Inaddition, the process ID allocating circuit 201 notifies the informationrelated to the process IDs allocated to the respective arithmeticprocessing circuits 20 to the arithmetic processing circuits 20 asprocess ID allocation targets. Here, in a case of an arithmeticprocessing circuit 20 receiving the notification of the process IDs fromanother arithmetic processing circuit 20, the process ID allocatingcircuit 201 receives the notification of the process IDs allocated tothe respective arithmetic processing circuits 20 from the otherarithmetic processing circuit 20. Then, the process ID allocatingcircuit 201 outputs the obtained information related to the process IDsallocated to the respective arithmetic processing circuits 20 to thethread generating circuit 202.

The thread generating circuit 202 obtains, from the centralized controlcircuit 203, the process IDs allocated to the respective arithmeticprocessing circuits 20 by the process ID allocating circuit 201.Further, the thread generating circuit 202 receives input of a processto be processed by the arithmetic processing circuit 20 in which thethread generating circuit 202 itself is included from the centralizedcontrol circuit 203. Then, when an instruction to perform the processingof Allreduce is given in the obtained process, the thread generatingcircuit 202 performs the following processing.

The thread generating circuit 202 obtains the dimensions of thehypercube including the own arithmetic processing circuit. In thepresent embodiment, the thread generating circuit 202 obtains two as thedimensions of the hypercube in which the own arithmetic processingcircuit is included. Then, the thread generating circuit 202 divides adata string calculated by the obtained process into n×2^(n) blocks. “n”in this case is the dimensions of the hypercube. For example, in thepresent embodiment, the thread generating circuit 202 divides the datastring of the process into eight blocks, as illustrated in FIG. 5.

Further, supposing that the hypercube has n dimensions, the threadgenerating circuit 202 generates n threads by separating the generatedblocks into sets of 2^(n) blocks each. For example, in the presentembodiment, the thread generating circuit 202 generates two threads, forexample, threads Th1 and Th2 illustrated in FIG. 5. In addition, thethread generating circuit 202 assigns a cyclic ID of n bits to eachthread. For example, the thread generating circuit 202 generates ncyclic IDs of n bits in which the value of an sth (1≤s≤n) bit of the nbits is one, and the value of the other bits is zero, and assigns thecyclic IDs to the respective threads. In the present embodiment, thethread generating circuit 202 assigns “01” as a cyclic ID to the threadTh1, and assigns “10” as a cyclic ID to the thread Th2. Thereafter, thethread generating circuit 202 outputs information related to thegenerated threads Th1 and Th2 to the centralized control circuit 203together with the cyclic IDs assigned to the respective threads Th1 andTh2. The thread generating circuit 202 corresponds to an example of a“dividing circuit.” A thread corresponds to an example of a “group.”

The centralized control circuit 203 receives, from the thread generatingcircuit 202, input of the process IDs assigned to the respectivearithmetic processing circuits 20 including the own arithmeticprocessing circuit and the data string included in each block. Then, thecentralized control circuit 203 outputs the process IDs assigned to therespective arithmetic processing circuits 20 including the ownarithmetic processing circuit to the thread generating circuit 202.Thereafter, the centralized control circuit 203 receives input ofinformation related to the threads Th1 and Th2 and the cyclic IDsassigned to the threads Th1 and Th2 from the thread generating circuit202.

Then, the centralized control circuit 203 decides to perform Halvingwhen the processing of Allreduce is started. Then, the centralizedcontrol circuit 203 outputs, to the transmission and reception data sizecalculating circuit 204, the process ID of the own arithmetic processingcircuit, the information related to the threads Th1 and Th2, the cyclicIDs assigned to the threads Th1 and Th2, and the data string of eachblock. The centralized control circuit 203 outputs, to the destinationdetermining circuit 206, the information related to the threads Th1 andTh2, the cyclic IDs assigned to the threads Th1 and Th2, and the processID of the own arithmetic processing circuit. Then, the centralizedcontrol circuit 203 instructs the transmission and reception data sizecalculating circuit 204 and the destination determining circuit 206 toperform Halving.

The centralized control circuit 203 thereafter receives a notificationof completion of a communication from the memory control circuits 23.Then, in a case where the centralized control circuit 203 gave aninstruction to perform Halving in the previous communication, thecentralized control circuit 203 determines whether or not Halving hasbeen performed the number of times equal to the number of dimensions ofthe hypercube including the arithmetic processing circuits 20. In thepresent embodiment, the configuration of a two-dimensional hypercube isadopted. The centralized control circuit 203 therefore determineswhether or not Halving has been performed twice.

In the case of the configuration of a two-dimensional hypercube, forexample, data strings are transmitted from the respective arithmeticprocessing circuits 20 in a first communication for the thread Th1, asindicated by solid line arrows in the first communication in FIG. 5. Forexample, the arithmetic processing circuit 20 having the process ID of00 and the arithmetic processing circuit 20 having the process ID of 01exchange data, and the arithmetic processing circuit 20 having theprocess ID of 10 and the arithmetic processing circuit 20 having theprocess ID of 11 exchange data. In addition, in the first communicationfor the thread Th2, data strings are transmitted as indicated by brokenline arrows in the first communication in FIG. 5. For example, thearithmetic processing circuit 20 having the process ID of 00 and thearithmetic processing circuit 20 having the process ID of 10 exchangedata, and the arithmetic processing circuit 20 having the process ID of01 and the arithmetic processing circuit 20 having the process ID of 11exchange data. In addition, arrows of alternate long and short dashedlines in FIG. 5 indicate changes in the data strings possessed by therespective arithmetic processing circuits 20. The transmission data andthe determination of transmission destinations in this Halving will bedescribed later in detail.

When Halving has not been performed the number of times equal to thenumber of dimensions of the hypercube including the arithmeticprocessing circuits 20, the centralized control circuit 203 instructsthe transmission and reception data size calculating circuit 204 and thedestination determining circuit 206 to perform a next Halving.

In the case of the configuration of a two-dimensional hypercube, forexample, data strings are transmitted from the respective arithmeticprocessing circuits 20 in a second communication for the thread Th1, asindicated by solid line arrows in the second communication in FIG. 5.For example, the arithmetic processing circuit 20 having the process IDof 00 and the arithmetic processing circuit 20 having the process ID of10 exchange data, and the arithmetic processing circuit 20 having theprocess ID of 01 and the arithmetic processing circuit 20 having theprocess ID of 11 exchange data. In addition, in the second communicationfor the thread Th2, data strings are transmitted as indicated by brokenline arrows in the second communication in FIG. 5. For example, thearithmetic processing circuit 20 having the process ID of 00 and thearithmetic processing circuit 20 having the process ID of 01 exchangedata, and the arithmetic processing circuit 20 having the process ID of10 and the arithmetic processing circuit 20 having the process ID of 11exchange data. Each of the arithmetic processing circuits 20 therebyretains a data string obtained by integrating the data strings ofdifferent blocks in the thread Th1. In addition, each of the arithmeticprocessing circuits 20 retains a data string obtained by integrating thedata strings of different blocks in the thread Th2.

When Halving has been performed the number of times equal to the numberof dimensions of the hypercube, on the other hand, the centralizedcontrol circuit 203 determines whether or not to perform the processingof Allreduce with arithmetic processing circuits 20 not included in thehypercube. In the present embodiment, as illustrated in FIGS. 1A and 1B,the processing of Allreduce is performed also with the arithmeticprocessing circuits 20 included in the different hypercube. Accordingly,the centralized control circuit 203 decides to perform the processing ofAllreduce in an external topology. The centralized control circuit 203then requests an outside coupling processing performing circuit 209 toperform the processing of Allreduce in the external topology.Thereafter, the centralized control circuit 203 receives a notificationof completion of the processing of Allreduce in the external topologyfrom the outside coupling processing performing circuit 209.

FIG. 6 is a diagram illustrating an outline of a flow of processing upto Allreduces with an external topology. As an example, the processingof Allreduce in the external topology is performed by a method indicatedby thick line arrows in FIG. 6. As an example, arithmetic processingcircuits 20 assigned a same process ID between the system boards 1communicate with each other, and transfer data strings to each other.Each of the arithmetic processing circuits 20 may thereby obtain a datastring combined with a result of Halving in the external topology, asindicated by a processing result 250.

When the centralized control circuit 203 receives a notification ofcompletion of the processing of Allreduce in the external topology fromthe outside coupling processing performing circuit 209, the centralizedcontrol circuit 203 instructs the transmission and reception data sizecalculating circuit 204 and the destination determining circuit 206 toperform Doubling. The centralized control circuit 203 thereafterreceives a notification of completion of a communication from the memorycontrol circuits 23. Then, in a case where the centralized controlcircuit 203 gave an instruction to perform Doubling in the previouscommunication, the centralized control circuit 203 determines whether ornot Doubling has been performed the number of times equal to the numberof dimensions of the hypercube including the arithmetic processingcircuits 20. In the present embodiment, the configuration of atwo-dimensional hypercube is adopted. The centralized control circuit203 therefore determines whether or not Doubling has been performedtwice.

FIG. 7 is a diagram illustrating an outline of a flow of processing inDoubling. In the case of the configuration of a two-dimensionalhypercube, for example, data strings are transmitted from the respectivearithmetic processing circuits 20 in a first communication of Doublingfor the thread Th1, as indicated by solid line arrows in a fourthcommunication in FIG. 7. For example, the arithmetic processing circuit20 having the process ID of 00 and the arithmetic processing circuit 20having the process ID of 10 exchange data, and the arithmetic processingcircuit 20 having the process ID of 01 and the arithmetic processingcircuit 20 having the process ID of 11 exchange data. In addition, inthe first communication of Doubling for the thread Th2, data strings aretransmitted as indicated by broken line arrows in the fourthcommunication in FIG. 7. For example, the arithmetic processing circuit20 having the process ID of 00 and the arithmetic processing circuit 20having the process ID of 01 exchange data, and the arithmetic processingcircuit 20 having the process ID of 10 and the arithmetic processingcircuit 20 having the process ID of 11 exchange data.

Further, in the case of the configuration of a two-dimensionalhypercube, data strings are transmitted from the respective arithmeticprocessing circuits 20 in a second communication of Doubling for thethread Th1, as indicated by solid line arrows in a fifth communicationin FIG. 7. For example, the arithmetic processing circuit 20 having theprocess ID of 00 and the arithmetic processing circuit 20 having theprocess ID of 01 exchange data, and the arithmetic processing circuit 20having the process ID of 10 and the arithmetic processing circuit 20having the process ID of 11 exchange data. In addition, in the secondcommunication of Doubling for the thread Th2, data strings aretransmitted as indicated by broken line arrows in the fifthcommunication in FIG. 7. For example, the arithmetic processing circuit20 having the process ID of 00 and the arithmetic processing circuit 20having the process ID of 10 exchange data, and the arithmetic processingcircuit 20 having the process ID of 01 and the arithmetic processingcircuit 20 having the process ID of 11 exchange data. Consequently, asillustrated in FIG. 7, after the fifth communication, all of thearithmetic processing circuits 20 included in the two-dimensionalhypercube have data strings of a same value in each block. Thetransmission data and the determination of transmission destinations inthis Doubling will be described later in detail.

Here, unlike the present embodiment, there is a case where theprocessing of Allreduce is completed between the arithmetic processingcircuits 20 included in the hypercube including the own arithmeticprocessing circuit. In that case, the centralized control circuit 203immediately proceeds to Doubling after completion of Halving.

When Doubling has not been performed the number of times equal to thenumber of dimensions of the hypercube including the arithmeticprocessing circuits 20, the centralized control circuit 203 instructsthe transmission and reception data size calculating circuit 204 and thedestination determining circuit 206 to perform a next Doubling.

When Doubling has been performed the number of times equal to the numberof dimensions of the hypercube including the arithmetic processingcircuits 20, on the other hand, the centralized control circuit 203notifies performance of synchronization processing in an internaltopology to the synchronization processing circuit 210. When thecentralized control circuit 203 thereafter receives a notification ofcompletion of synchronization in the internal topology from thesynchronization processing circuit 210, the centralized control circuit203 notifies performance of synchronization processing in the externaltopology to the synchronization processing circuit 210. When thecentralized control circuit 203 thereafter receives a notification ofcompletion of synchronization in the external topology from thesynchronization processing circuit 210, the centralized control circuit203 determines that the processing of Allreduce is completed. Then, thecentralized control circuit 203 notifies the job managing circuit 27 ofcompletion of the processing of Allreduce.

The transmission and reception data size calculating circuit 204receives, from the centralized control circuit 203, input of theinformation related to the threads Th1 and Th2, the cyclic IDs assignedto the threads Th1 and Th2, and the data string of each block. Then, thetransmission and reception data size calculating circuit 204 calculatesa data size for performing transmission and reception.

In a case where the arithmetic processing circuits 20 are coupled so asto form an n-dimensional hypercube, when the transmission and receptiondata size calculating circuit 204 is instructed to perform Halving, thetransmission and reception data size calculating circuit 204 sets theblocks of 2^(n)−1 data strings as a data size for transmission andreception in a first communication.

Thereafter, as the number of communications increases, the transmissionand reception data size calculating circuit 204 sets ½ of the blocks fortransmission and reception as a data size. For example, in the secondcommunication, the transmission and reception data size calculatingcircuit 204 sets (2^(n)−1)×½ blocks as a data size for transmission andreception. In the present embodiment, the transmission and receptiondata size calculating circuit 204 sets two blocks as a data size fortransmission and reception in the first communication. Further, thetransmission and reception data size calculating circuit 204 sets oneblock as a data size for transmission and reception in the secondcommunication. Thereafter, the transmission and reception data sizecalculating circuit 204 outputs the determined data sizes fortransmission and reception to the target data determining circuit 205.In addition, the transmission and reception data size calculatingcircuit 204 outputs, to the target data determining circuit 205, theinformation related to the threads Th1 and Th2, the cyclic IDs assignedto the threads Th1 and Th2, and the data string of each block.

When the transmission and reception data size calculating circuit 204 isinstructed to perform Doubling, on the other hand, the transmission andreception data size calculating circuit 204 sets the block of one datastring as a data size for transmission and reception in the firstcommunication. Thereafter, as the number of communications increases,the transmission and reception data size calculating circuit 204 setsdouble the block for transmission and reception as a data size. Forexample, in the second communication, the transmission and receptiondata size calculating circuit 204 sets two blocks as a data size fortransmission and reception. In the present embodiment, the transmissionand reception data size calculating circuit 204 sets one block as a datasize for transmission and reception in the first communication. Further,the transmission and reception data size calculating circuit 204 setstwo blocks as a data size for transmission and reception in the secondcommunication. Thereafter, the transmission and reception data sizecalculating circuit 204 outputs the determined data sizes fortransmission and reception to the target data determining circuit 205.In addition, the transmission and reception data size calculatingcircuit 204 outputs the information related to the threads Th1 and Th2,the cyclic IDs assigned to the threads Th1 and Th2, and the data stringof each block to the target data determining circuit 205.

The target data determining circuit 205 receives, from the transmissionand reception data size calculating circuit 204, input of the data sizesfor transmission and reception, the information related to the threadsTh1 and Th2, the cyclic IDs assigned to the threads Th1 and Th2, and thedata string of each block. The target data determining circuit 205 fixesthe arrangement order of each block. The target data determining circuit205 maintains the arrangement order of the blocks until the processingof Allreduce is completed.

FIG. 8 is a diagram of assistance in explaining a method of determiningdata blocks for transmission and reception on a thread Th1 side. As anexample, the target data determining circuit 205 fixes the arrangementorder of blocks in order illustrated in FIG. 8. In the following,relation of the blocks whose arrangement order is fixed as in FIG. 8will be described using a vertical direction facing a paper plane. Thetarget data determining circuit 205 determines data strings fortransmission and reception for the respective threads Th1 and Th2. Thefollowing description will be made of details of a method of determiningdata for transmission and reception by the target data determiningcircuit 205.

When the target data determining circuit 205 is instructed to performHalving, the target data determining circuit 205 generates targetdetermination groups by dividing the threads Th1 and Th2 by the datasize for transmission and reception. The target data determining circuit205 divides the threads Th1 and Th2 into two in the first communication.In the second or subsequent communications, the target data determiningcircuit 205 divides the threads Th1 and Th2 by a number twice that of aprevious communication as the number of communications increases. Forexample, in the case of the second communication, the target datadetermining circuit 205 divides the threads Th1 and Th2 into four.

In the present embodiment, when the process ID of the arithmeticprocessing circuit 20 is 00, the target data determining circuit 205divides the thread Th1 into target determination groups 301 and 302 inFIG. 8 in the first communication. In addition, in the secondcommunication, the target data determining circuit 205 divides thethread Th1 into four target determination groups including targetdetermination groups 305 and 306.

Similarly, when the process ID of the arithmetic processing circuit 20is 01, the target data determining circuit 205 divides the thread Th1into target determination groups 311 and 312 in FIG. 8 in the firstcommunication. In addition, in the second communication, the target datadetermining circuit 205 divides the thread Th1 into four targetdetermination groups including target determination groups 315 and 316.In the following, the arrangement of the target determination groupswill also be described in the vertical direction facing the paper planeso as to correspond to the arrangement of blocks.

Next, the target data determining circuit 205 assigns transmission andreception target indexes to the generated target determination groups.Here, the target data determining circuit 205 assigns the targetdetermination groups the number of k bits (k digits) represented by abinary number as a kth transmission and reception target index such thatthe number of k bits sequentially increases from a top to a bottom. Inthe case of the first communication, the target data determining circuit205 assigns the target determination groups a binary number representedby one bit as a transmission and reception target index such that thebinary number sequentially increases. In addition, in the case of thesecond communication, the target data determining circuit 205 assignsthe target determination groups a binary number represented by 2 bits asa transmission and reception target index such that the binary numbersequentially increases.

In the present embodiment, in the first communication, when the processID of the arithmetic processing circuit 20 is 00, the target datadetermining circuit 205 assigns “0” as a transmission and receptiontarget index to the target determination group 301, as illustrated inFIG. 8. In addition, the target data determining circuit 205 assigns “1”as a transmission and reception target index to the target determinationgroup 302.

Similarly, when the process ID of the arithmetic processing circuit 20is 01, the target data determining circuit 205 assigns “0” as atransmission and reception target index to the target determinationgroup 311 in the first communication. In addition, the target datadetermining circuit 205 assigns “1” as a transmission and receptiontarget index to the target determination group 312.

In addition, in the second communication, when the process ID of thearithmetic processing circuit 20 is 00, the target data determiningcircuit 205 assigns “00” as a transmission and reception target index tothe target determination group 305. In addition, the target datadetermining circuit 205 assigns “01” as a transmission and receptiontarget index to the target determination group 306. In addition, thetarget data determining circuit 205 assigns “10” and “11” as respectivetransmission and reception target indexes to the two other targetdetermination groups following below.

Similarly, when the process ID of the arithmetic processing circuit 20is 01, the target data determining circuit 205 assigns “00” as atransmission and reception target index to the topmost targetdetermination group in the second communication. In addition, the targetdata determining circuit 205 assigns “01” as a transmission andreception target index to the target determination group below thetopmost target determination group. Further, the target data determiningcircuit 205 assigns “10” as a transmission and reception target index tothe target determination group 315, and assigns “11” as a transmissionand reception target index to the target determination group 316.

Next, the target data determining circuit 205 determines an area to beset as a transmission target and an area to be set as a reception targetfrom among transmission and reception target groups. Here, an area to beset as a reception target is an area of data to be set as a target ofoperation using received data.

For example, in the first communication, the target data determiningcircuit 205 sets, as a reception target area, a target determinationgroup in which a first bit of the process ID and the transmission andreception target index are a same value. In addition, the target datadetermining circuit 205 sets, as a transmission target area, a targetdetermination group in which a value obtained by inverting the first bitof the process ID and the transmission and reception target index are asame value.

For example, when the process ID is 00, the first bit is “0,” andtherefore the target data determining circuit 205 sets the targetdetermination group 301 having a transmission and reception target indexof “0” as the reception target area. In addition, the target datadetermining circuit 205 sets the target determination group 302 having atransmission and reception target index of “1” as the transmissiontarget area. In addition, when the process ID is 01, the first bit is“1,” and therefore the target data determining circuit 205 sets thetarget determination group 312 having a transmission and receptiontarget index of “1” as the reception target area. In addition, thetarget data determining circuit 205 sets the target determination group311 having a transmission and reception target index of “0” as thetransmission target area.

Then, the target data determining circuit 205 sets, as reception targetdata, the data strings of blocks corresponding to the targetdetermination group set as the reception target area. In addition, thetarget data determining circuit 205 sets, as transmission target data,the data strings of blocks corresponding to the target determinationgroup set as the transmission target area.

For example, in the first communication, the target data determiningcircuit 205 of the arithmetic processing circuit 20 having the processID of 00 sets, as the reception target data, the data strings of blocksincluded in an area 303 corresponding to the target determination group301. In addition, the target data determining circuit 205 sets, as thetransmission target data, the data strings of blocks included in an area304 corresponding to the target determination group 302.

In addition, in the first communication, the target data determiningcircuit 205 of the arithmetic processing circuit 20 having the processID of 01 sets, as the reception target data, the data strings of blocksincluded in an area 314 corresponding to the target determination group312. In addition, the target data determining circuit 205 sets, as thetransmission target data, the data strings of blocks included in an area313 corresponding to the target determination group 311.

Next, in the second communication, the target data determining circuit205 sets target determination groups in which the value of the first bitof the process ID and the value of a second bit of the transmission andreception target index are a same value as target determination groupsfor transmission and reception targets. For example, the targetdetermination group included in the reception target area in theprevious communication becomes the target determination groups fortransmission and reception targets in the present communication.Further, the target data determining circuit 205 selects a targetdetermination group in which the value of a second bit of the process IDand the value of a first bit of the transmission and reception targetindex are a same value as the reception target area from the targetdetermination groups for transmission and reception targets. Inaddition, the target data determining circuit 205 sets, as thetransmission target area, a target determination group in which a valueobtained by inverting the second bit of the process ID and the value ofthe first bit of the transmission and reception target index are a samevalue.

For example, when the process ID is 00, the first bit is “0,” andtherefore the target data determining circuit 205 sets the targetdetermination groups 305 and 306 in which the second bit of thetransmission and reception target index is “0” as the targetdetermination groups for transmission and reception targets. Then, whenthe process ID is 00, the second bit is “0,” and therefore the targetdata determining circuit 205 sets the target determination group 305 inwhich the first bit of the transmission and reception target index is“0” as the reception target area. In addition, the target datadetermining circuit 205 sets the target determination group 306 in whichthe transmission and reception target index is “1” as the transmissiontarget area.

In addition, when the process ID is 01, the first bit is “1,” andtherefore the target data determining circuit 205 sets the targetdetermination groups 315 and 316 in which the second bit of thetransmission and reception target index is “1” as the targetdetermination groups for transmission and reception targets. Then, whenthe process ID is 01, the second bit is “0,” and therefore the targetdata determining circuit 205 sets the target determination group 315 inwhich the first bit of the transmission and reception target index is“0” as the reception target area. In addition, the target datadetermining circuit 205 sets the target determination group 316 in whichthe transmission and reception target index is “1” as the transmissiontarget area.

Then, the target data determining circuit 205 sets, as the receptiontarget data, the data string of a block corresponding to the targetdetermination group set as the reception target area. In addition, thetarget data determining circuit 205 sets, as the transmission targetdata, the data string of a block corresponding to the targetdetermination group set as the transmission target area.

For example, in the second communication, the target data determiningcircuit 205 of the arithmetic processing circuit 20 having the processID of 00 sets, as the reception target data, the data string of a blockincluded in an area 307 corresponding to the target determination group305. In addition, the target data determining circuit 205 sets, as thetransmission target data, the data string of a block included in an area308 corresponding to the target determination group 306. In addition, inthe second communication, the target data determining circuit 205 of thearithmetic processing circuit 20 having the process ID of 01 sets, asthe reception target data, the data string of a block included in anarea 317 corresponding to the target determination group 315. Inaddition, the target data determining circuit 205 sets, as thetransmission target data, the data string of a block included in an area318 corresponding to the target determination group 316.

In the present embodiment, which represents the case of atwo-dimensional hypercube, Halving is completed in the secondcommunication. However, in a case where a coupling is made so as to forma hypercube of three dimensions or more, communication of Halvingfurther continues. In a pth communication, for the thread Th1, thetarget data determining circuit 205 determines the transmission targetarea and the reception target area as follows. The target datadetermining circuit 205 sets, as the target determination groups fortransmission and reception targets, target determination groups in whicha sequence obtained by reversing the arrangement order of a value fromthe first bit to a (p−1)th bit of the process ID and a sequence obtainedby arranging a value from the second bit to a pth bit of thetransmission and reception target index coincide with each other. Then,the target data determining circuit 205 selects a target determinationgroup in which the value of a pth bit of the process ID and the value ofthe first bit of the transmission and reception target index are a samevalue as the reception target area from the target determination groupsfor transmission and reception targets. In addition, the target datadetermining circuit 205 sets, as the reception target area, a targetdetermination group in which a value obtained by inverting the pth bitof the process ID and the value of the first bit of the transmission andreception target index are a same value.

For example, set as the reception target area is a target determinationgroup in which a sequence obtained by reversing the arrangement order ofa value from the first bit to the pth bit of the process ID and asequence obtained by arranging a value from the first bit to the pth bitof the transmission and reception target index coincide with each other.In addition, set as the transmission target area is a targetdetermination group in which a sequence obtained by inverting a leastsignificant bit in the sequence obtained by reversing the arrangementorder of the value from the first bit to the pth bit of the process IDand a sequence obtained by arranging a value from the pth bit to thefirst bit of the transmission and reception target index coincide witheach other.

FIG. 9 is a diagram illustrating a whole of Halving for a thread Th1 byfour arithmetic processing circuits included in a two-dimensionalhypercube. In FIG. 9, blocks enclosed by solid lines are receptiontarget data. In addition, blocks enclosed by broken lines aretransmission target data. A right side facing a paper plane of FIG. 9represents blocks corresponding to data transmitted and received in therespective arithmetic processing circuits 20, and a left side representstransmission target and reception target areas.

The target data determining circuits 205 of the respective arithmeticprocessing circuits 20 included in the hypercube select transmissiontarget data and reception target data for the thread Th1 in eachcommunication as illustrated in FIG. 9. Then, for the thread Th1, thetarget data determining circuits 205 of the respective arithmeticprocessing circuits 20 repeat transmitting the transmission target dataillustrated in FIG. 9, and performing operation using received data andthe reception target data illustrated in FIG. 9.

In addition, for the thread Th2, the target data determining circuit 205generates target determination groups by dividing the thread Th2 by thedata size for transmission and reception. The target data determiningcircuit 205 divides the thread Th2 into two in the first communication.In the second or subsequent communications, the target data determiningcircuit 205 divides the thread Th2 by a number twice that of a previouscommunication as the number of communications increases. For example, inthe case of the second communication, the target data determiningcircuit 205 divides the thread Th2 into four. FIG. 10 is a diagram ofassistance in explaining a method of determining data blocks fortransmission and reception on a thread Th2 side in Halving.

For example, in the present embodiment, when the process ID of thearithmetic processing circuit 20 is 00, the target data determiningcircuit 205 divides the thread Th2 into target determination groups 331and 332 in FIG. 10 in the first communication. In addition, in thesecond communication, the target data determining circuit 205 dividesthe thread Th2 into four target determination groups including targetdetermination groups 335 and 336. Similarly, when the process ID of thearithmetic processing circuit 20 is 01, the target data determiningcircuit 205 divides the thread Th2 into target determination groups 341and 342 in FIG. 10 in the first communication. In addition, in thesecond communication, the target data determining circuit 205 dividesthe thread Th2 into four target determination groups including targetdetermination groups 345 and 346.

Next, as in the case of the thread Th1, the target data determiningcircuit 205 assigns transmission and reception target indexes to thegenerated target determination groups. In the present embodiment, in thefirst communication, when the process ID of the arithmetic processingcircuit 20 is 00, the target data determining circuit 205 assigns “0” asa transmission and reception target index to the target determinationgroup 331, as illustrated in FIG. 10. In addition, the target datadetermining circuit 205 assigns “1” as a transmission and receptiontarget index to the target determination group 332.

Similarly, in the first communication, when the process ID of thearithmetic processing circuit 20 is 01, the target data determiningcircuit 205 assigns “0” as a transmission and reception target index tothe target determination group 341, and assigns “1” as a transmissionand reception target index to the target determination group 342.

In addition, in the second communication, when the process ID of thearithmetic processing circuit 20 is 00, the target data determiningcircuit 205 assigns “00” as a transmission and reception target index tothe target determination group 335. In addition, the target datadetermining circuit 205 assigns “01” as a transmission and receptiontarget index to the target determination group 336. In addition, thetarget data determining circuit 205 assigns “10” and “11” as respectivetransmission and reception target indexes to the two other targetdetermination groups following below.

Similarly, when the process ID of the arithmetic processing circuit 20is 01, the target data determining circuit 205 assigns “00” as atransmission and reception target index to the topmost targetdetermination group in the second communication. In addition, the targetdata determining circuit 205 assigns “01” as a transmission andreception target index to the target determination group below thetopmost target determination group. Further, the target data determiningcircuit 205 assigns “10” and “11” as respective transmission andreception target indexes to the two other target determination groupsfollowing below.

Next, the target data determining circuit 205 determines an area to beset as a transmission target and an area to be set as a reception targetfrom among transmission and reception target groups. For example, thetarget data determining circuit 205 checks the position of a bit havinga value of one in the cyclic ID of the thread Th2, and sets the numberof the bit having the value of one as a reference for rearrangement. Inthe present embodiment, the target data determining circuit 205 checksthat the value of the second bit is one, and the target data determiningcircuit 205 sets two as the reference for rearrangement.

In the first communication, the target data determining circuit 205sets, as the reception target area, a target determination group inwhich the second bit of the process ID, the second bit being thereference for rearrangement, and the transmission and reception targetindex are a same value. In addition, the target data determining circuit205 sets, as the transmission target area, a target determination groupin which a value obtained by inverting the second bit of the process IDand the transmission and reception target index are a same value.

For example, when the process ID is 00, the second bit is “0,” andtherefore the target data determining circuit 205 sets the targetdetermination group 331 having a transmission and reception target indexof “0” as the reception target area. In addition, the target datadetermining circuit 205 sets the target determination group 332 having atransmission and reception target index of “1” as the transmissiontarget area. In addition, when the process ID is 01, the second bit is“0,” and therefore the target data determining circuit 205 sets thetarget determination group 341 having a transmission and receptiontarget index of “0” as the reception target area. In addition, thetarget data determining circuit 205 sets the target determination group342 having a transmission and reception target index of “1” as thetransmission target area.

Then, the target data determining circuit 205 sets, as reception targetdata, the data strings of blocks corresponding to the targetdetermination group set as the reception target area. In addition, thetarget data determining circuit 205 sets, as transmission target data,the data strings of blocks corresponding to the target determinationgroup set as the transmission target area.

For example, in the first communication, the target data determiningcircuit 205 of the arithmetic processing circuit 20 having the processID of 00 sets, as the reception target data, the data strings of blocksincluded in an area 333 corresponding to the target determination group331. In addition, the target data determining circuit 205 sets, as thetransmission target data, the data strings of blocks included in an area334 corresponding to the target determination group 332.

In addition, in the first communication, the target data determiningcircuit 205 of the arithmetic processing circuit 20 having the processID of 01 sets, as the reception target data, the data strings of blocksincluded in an area 343 corresponding to the target determination group341. In addition, the target data determining circuit 205 sets, as thetransmission target data, the data strings of blocks included in an area344 corresponding to the target determination group 342. The datastrings included in the blocks determined as the transmission targetdata by the target data determining circuits 205 in the firstcommunication correspond to an example of “first data blocks.”

Next, in the second communication, the target data determining circuit205 sets target determination groups in which the value of the secondbit of the process ID, the second bit being the reference forrearrangement, and the value of the second bit of the transmission andreception target index are a same value as target determination groupsfor transmission and reception targets. For example, the targetdetermination group included in the reception target area in theprevious communication becomes the target determination groups fortransmission and reception targets in the present communication.Further, the target data determining circuit 205 selects a targetdetermination group in which the value of the first bit of the processID and the value of the first bit of the transmission and receptiontarget index are a same value as the reception target area from thetarget determination groups for transmission and reception targets. Inaddition, the target data determining circuit 205 sets, as thetransmission target area, a target determination group in which a valueobtained by inverting the first bit of the process ID and the value ofthe first bit of the transmission and reception target index are a samevalue.

For example, when the process ID is 00, the second bit is “0,” andtherefore the target data determining circuit 205 sets the targetdetermination groups 335 and 336 in which the second bit of thetransmission and reception target index is “0” as the targetdetermination groups for transmission and reception targets. Then, whenthe process ID is 00, the first bit is “0,” and therefore the targetdata determining circuit 205 sets the target determination group 335 inwhich the first bit of the transmission and reception target index is“0” as the reception target area. In addition, the target datadetermining circuit 205 sets the target determination group 336 in whichthe first bit of the transmission and reception target index is “1” asthe transmission target area.

In addition, when the process ID is 01, the second bit is “0,” andtherefore the target data determining circuit 205 sets the targetdetermination groups 345 and 346 in which the second bit of thetransmission and reception target index is “0” as the targetdetermination groups for transmission and reception targets. Then, whenthe process ID is 01, the first bit is “1,” and therefore the targetdata determining circuit 205 sets the target determination group 346 inwhich the first bit of the transmission and reception target index is“1” as the reception target area. In addition, the target datadetermining circuit 205 sets the target determination group 345 in whichthe first bit of the transmission and reception target index is “0” asthe transmission target area.

Then, the target data determining circuit 205 sets, as reception targetdata, the data string of a block corresponding to the targetdetermination group set as the reception target area. In addition, thetarget data determining circuit 205 sets, as transmission target data,the data string of a block corresponding to the target determinationgroup set as the transmission target area.

For example, in the second communication, the target data determiningcircuit 205 of the arithmetic processing circuit 20 having the processID of 00 sets, as the reception target data, the data string of a blockincluded in an area 337 corresponding to the target determination group335. In addition, the target data determining circuit 205 sets, as thetransmission target data, the data string of a block included in an area338 corresponding to the target determination group 336. In addition, inthe second communication, the target data determining circuit 205 of thearithmetic processing circuit 20 having the process ID of 01 sets, asthe reception target data, the data string of a block included in anarea 348 corresponding to the target determination group 346. Inaddition, the target data determining circuit 205 sets, as thetransmission target data, the data string of a block included in an area347 corresponding to the target determination group 345. The datastrings included in the blocks determined as the transmission targetdata by the target data determining circuits 205 in the secondcommunication correspond to an example of “second data blocks.”

In the present embodiment, which represents the case where thearithmetic processing circuits 20 are coupled so as to form atwo-dimensional hypercube, Halving is completed in the secondcommunication as in the case of the thread Th1. However, in a case of ahypercube of three dimensions or more, communication of Halving furthercontinues. In a pth communication, for the thread Th2, the target datadetermining circuit 205 generates a sequence obtained by reversing thearrangement order of a value from the second bit to the pth bit of theprocess ID, the second bit being the reference for rearrangement. Thetarget data determining circuit 205 sets, as the target determinationgroups for transmission and reception targets, target determinationgroups in which the generated sequence and a sequence obtained byarranging a value from the pth bit to the second bit of the transmissionand reception target index coincide with each other. Then, the targetdata determining circuit 205 selects a target determination group inwhich the value of the first bit of the process ID and the value of thefirst bit of the transmission and reception target index are a samevalue as the reception target area from the target determination groupsfor transmission and reception targets. In addition, the target datadetermining circuit 205 sets, as the reception target area, a targetdetermination group in which a value obtained by inverting the pth bitof the process ID and the value of the first bit of the transmission andreception target index are a same value.

The transmission target area and the reception target area in the pthcommunication may be expressed by using a rearrangement sequence of pbits in a state in which the value of the first bit immediately belowthe reference for rearrangement of the process ID is moved to a bit nextto a pth bit, and the numbers of the bits are renumbered 1 to p. Forexample, set as the reception target area is a target determinationgroup in which a sequence obtained by reversing the arrangement order ofa value from the first bit to the pth bit of the rearrangement sequenceand a sequence obtained by arranging a value from the first bit to thepth bit of the transmission and reception target index coincide witheach other. In addition, set as the transmission target area is a targetdetermination group in which a sequence obtained by inverting a leastsignificant bit in the sequence obtained by reversing the arrangementorder of the value from the first bit to the pth bit of therearrangement sequence and the sequence obtained by arranging the valuefrom the first bit to the pth bit of the transmission and receptiontarget index coincide with each other.

Further, in a case of the configuration of an n-dimensional hypercube,the number of threads is n. The target data determining circuit 205determines transmission target data and reception target data for eachof the n threads. Accordingly, consideration will be given to a case ofa thread Thq (1≤q≤n). The target data determining circuit 205 checks theposition of a bit having a value of one in the cyclic ID of the threadThq, and sets the number of the bit having the value of one as areference for rearrangement. In this case, the target data determiningcircuit 205 checks that the value of a qth bit is one, and the targetdata determining circuit 205 sets q as the reference for rearrangement.

In this case, the rearrangement sequence is a sequence of p bits in astate in which the sequence of a value from the first bit of the processID to a (q−1)th bit thereof immediately below the reference forrearrangement is moved next to a pth bit, for example, moved to anopposite side from a (p−1)th bit, and the numbers of the bits arerenumbered 1 to p. Then, set as the reception target area is a targetdetermination group in which the sequence obtained by reversing thearrangement order of the value from the first bit to the pth bit of therearrangement sequence and the sequence obtained by arranging the valuefrom the first bit to the pth bit of the transmission and receptiontarget index coincide with each other. In addition, set as thetransmission target area is a target determination group in which thesequence obtained by inverting the least significant bit in the sequenceobtained by reversing the arrangement order of the value from the firstbit to the pth bit of the rearrangement sequence and the sequenceobtained by arranging the value from the first bit to the pth bit of thetransmission and reception target index coincide with each other.

FIG. 11 is a diagram illustrating Halving for a thread Th2 by fourarithmetic processing circuits included in a two-dimensional hypercube.In FIG. 11, blocks enclosed by solid lines are reception target data. Inaddition, blocks enclosed by broken lines are transmission target data.A right side facing a paper plane of FIG. 11 represents blockscorresponding to data transmitted and received in the respectivearithmetic processing circuits 20, and a left side representstransmission target areas and reception target areas.

The target data determining circuits 205 of the respective arithmeticprocessing circuits 20 select transmission target data and receptiontarget data for the thread Th2 in each communication as illustrated inFIG. 11. Then, for the thread Th2, the target data determining circuits205 of the respective arithmetic processing circuits 20 repeattransmitting the transmission target data illustrated in FIG. 11, andperforming operation using received data and the reception target dataillustrated in FIG. 11.

When the target data determining circuit 205 is instructed to performDoubling, on the other hand, the target data determining circuit 205determines transmission and reception target data so as to trace anopposite procedure from that of Halving. For example, the target datadetermining circuit 205 determines transmission and reception targetdata by the following method.

Target determination groups are generated by dividing the threads Th1and Th2 by the data size for transmission and reception. For example, inthe case of the configuration of an n-dimensional hypercube, the targetdata determining circuit 205 divides the threads into 2^(n) in the firstcommunication in Doubling. In the second or subsequent communications,the target data determining circuit 205 divides the threads Th1 and Th2by a number ½ times that of a previous communication as the number ofcommunications increases. For example, in the case of the secondcommunication, the target data determining circuit 205 divides thethreads into 2^(n-1).

Next, as in the case of Halving, the target data determining circuit 205assigns transmission and reception target indexes to the generatedtarget determination groups.

In an (n+1−p)th communication, for the thread Th1, the target datadetermining circuit 205 determines a transmission target area and areception target area as follows. The target data determining circuit205 sets, as target determination groups for transmission and receptiontargets, target determination groups in which the sequence obtained byreversing the arrangement order of the value from the first bit to the(p−1)th bit of the process ID and the sequence obtained by arranging thevalue from the second bit to the pth bit of the transmission andreception target index coincide with each other. Then, the target datadetermining circuit 205 selects a target determination group in whichthe value obtained by inverting the pth bit of the process ID and thevalue of the first bit of the transmission and reception target indexare a same value as the reception target area from the targetdetermination groups for transmission and reception targets. Inaddition, the target data determining circuit 205 sets, as thetransmission target area, a target determination group in which thevalue of the pth bit of the process ID and the value of the first bit ofthe transmission and reception target index are a same value.

For example, set as the transmission target area is a targetdetermination group in which the sequence obtained by reversing thearrangement order of the value from the first bit to the pth bit of theprocess ID and the sequence obtained by arranging the value from thefirst bit to the pth bit of the transmission and reception target indexcoincide with each other. In addition, set as the reception target areais a target determination group in which the sequence obtained byinverting the least significant bit in the sequence obtained byreversing the arrangement order of the value from the first bit to thepth bit of the process ID and the sequence obtained by arranging thevalue from the pth bit to the first bit of the transmission andreception target index coincide with each other.

Further, in the case of the configuration of an n-dimensional hypercube,the number of threads is n. The target data determining circuit 205determines transmission target data and reception target data for eachof the n threads. Accordingly, consideration will be given to an(n+1−p)th communication in a case of a thread Thq (1≤q≤n). The targetdata determining circuit 205 checks the position of a bit having a valueof one in the cyclic ID of the thread Thq, and sets the number of thebit having the value of one as a reference for rearrangement. In thiscase, the target data determining circuit 205 checks that the value of aqth bit is one, and the target data determining circuit 205 sets q asthe reference for rearrangement.

In this case, the rearrangement sequence is a sequence of p bits in astate in which the sequence of a value from the first bit of a processID to a (q−1)th bit thereof immediately below the reference forrearrangement is moved next to a pth bit, for example, moved to anopposite side from a (p−1)th bit, and the numbers of the bits arerenumbered 1 to p. Then, set as the reception target area is a targetdetermination group in which the sequence obtained by reversing thearrangement order of the value from the first bit to the pth bit of therearrangement sequence and the sequence obtained by arranging the valuefrom the first bit to the pth bit of the transmission and receptiontarget index coincide with each other. In addition, set as thetransmission target area is a target determination group in which thesequence obtained by inverting the least significant bit in the sequenceobtained by reversing the arrangement order of the value from the firstbit to the pth bit of the rearrangement sequence and the sequenceobtained by arranging the value from the first bit to the pth bit of thetransmission and reception target index coincide with each other.

Returning to FIG. 4B, description will be continued. The target datadetermining circuit 205 outputs information related to the transmissiontarget data in the threads Th1 and Th2 to the data transmitting circuit207. In addition, the target data determining circuit 205 outputsinformation related to the reception target data in the threads Th1 andTh2 to the data receiving circuit 208. The target data determiningcircuit 205 corresponds to an example of a “data selecting circuit.”

The destination determining circuit 206 receives, from the centralizedcontrol circuit 203, input of the information related to the threads Th1and Th2, the cyclic IDs assigned to the threads Th1 and Th2, and theprocess ID of the own arithmetic processing circuit. Then, thedestination determining circuit 206 determines an arithmetic processingcircuit 20 as a transmission destination by performing the followingprocessing for each of the threads. The following description will bemade of the processing performed by the destination determining circuit206 for each thread. FIG. 12 is a diagram of assistance in explainingdestination determination processing by a destination determiningcircuit.

In the case of the first communication after the performance of Halvingis specified, the destination determining circuit 206 obtains anexclusive disjunction of the process ID and the cyclic ID (CyclicID)assigned to the thread. Then, the destination determining circuit 206sets, as a transmission destination of data in the thread, an arithmeticprocessing circuit 20 having the obtained exclusive disjunction as atransmission destination process ID.

For example, in the case of the first communication, when the process IDis 00 as illustrated in FIG. 12, the destination determining circuit 206obtains an exclusive disjunction of “00,” which is the process ID, and“01,” which is the cyclic ID assigned to the thread Th1, and thedestination determining circuit 206 thereby obtains “01” as a value.Then, the destination determining circuit 206 determines the arithmeticprocessing circuit 20 having the process ID of 01 as a transmissiondestination of data of the thread Th1. In addition, the destinationdetermining circuit 206 obtains an exclusive disjunction of “00,” whichis the process ID, and “10,” which is the cyclic ID assigned to thethread Th2, and the destination determining circuit 206 thereby obtains“10” as a value. Then, the destination determining circuit 206determines the arithmetic processing circuit 20 having the process ID of10 as a transmission destination of data of the thread Th2.

In the case of the second communication, the destination determiningcircuit 206 shifts the position of 1 of the cyclic ID assigned to eachthread to the left by one. The processing of shifting the position of 1of the cyclic ID to the left by one corresponds to an example ofprocessing of “shifting by one.” For example, the destinationdetermining circuit 206 shifts the position of 1 of the cyclic IDassigned to each thread to a higher position by one bit. At this time,in a case of a cyclic ID whose highest-order bit has a value of one, thedestination determining circuit 206 sets the value of a first bit of thecyclic ID to one. For example, the destination determining circuit 206changes the cyclic ID of the thread Th1 from “01” to “10.” In addition,the destination determining circuit 206 changes the cyclic ID of thethread Th2 from “10” to “01.”

Next, the destination determining circuit 206 obtains an exclusivedisjunction of the process ID and the cyclic ID of the thread after thechange. Then, the destination determining circuit 206 sets, as atransmission destination of data in the thread, an arithmetic processingcircuit 20 having the obtained exclusive disjunction as a transmissiondestination process ID.

For example, in the case of the second communication, when the processID is 00, the destination determining circuit 206 obtains an exclusivedisjunction of “00,” which is the process ID, and “10,” which is thecyclic ID of the thread Th1 after the change, and the destinationdetermining circuit 206 thereby obtains “10” as a value. Then, thedestination determining circuit 206 determines the arithmetic processingcircuit 20 having the process ID of 10 as a transmission destination ofdata of the thread Th1. In addition, the destination determining circuit206 obtains an exclusive disjunction of “00,” which is the process ID,and “01,” which is the cyclic ID of the thread Th2 after the change, andthe destination determining circuit 206 thereby obtains “01” as a value.Then, the destination determining circuit 206 determines the arithmeticprocessing circuit 20 having the process ID of 01 as a transmissiondestination of data of the thread Th2.

FIG. 13 is a diagram of assistance in explaining destination transitionsin a two-dimensional hypercube. Respective circles in FIG. 13 representthe arithmetic processing circuits 20, and numbers provided to therespective circles represent the process IDs of the respectivearithmetic processing circuits 20. In the first communication, asindicated by a solid line arrow 401, the arithmetic processing circuit20 having the process ID of 00 sets the arithmetic processing circuit 20having the process ID of 01 as a transmission destination of data of thethread Th1. In addition, as indicated by a broken line arrow 402, thearithmetic processing circuit 20 having the process ID of 00 sets thearithmetic processing circuit 20 having the process ID of 10 as atransmission destination of data of the thread Th2. Next, in the secondcommunication, as indicated by a solid line arrow 403, the arithmeticprocessing circuit 20 having the process ID of 00 sets the arithmeticprocessing circuit 20 having the process ID of 10 as a transmissiondestination of data of the thread Th1. In addition, as indicated by abroken line arrow 404, the arithmetic processing circuit 20 having theprocess ID of 00 sets the arithmetic processing circuit 20 having theprocess ID of 01 as a transmission destination of data of the threadTh2.

Thus, the arithmetic processing circuit 20 sets different arithmeticprocessing circuits 20 as transmission destinations of data for therespective threads in one communication. The arithmetic processingcircuit 20 may therefore transmit the data to all of adjacent arithmeticprocessing circuits 20 in one communication. It is thus possible tosuppress occurrence of an unused path, and make full use of a band.

In the case of the first communication after the performance of Doublingis specified, on the other hand, the destination determining circuit 206obtains the cyclic ID used in the final communication of Halving. Next,the destination determining circuit 206 obtains an exclusive disjunctionof the process ID and the cyclic ID assigned to the thread. Then, thedestination determining circuit 206 sets an arithmetic processingcircuit 20 having the obtained exclusive disjunction as a transmissiondestination process ID as a transmission destination of data in thethread.

Thereafter, the destination determining circuit 206 shifts the positionof 1 of the cyclic ID used in each thread at a previous communication tothe right by one. For example, the destination determining circuit 206shifts the position of 1 of the process ID used in each thread in theprevious communication to a lower position by one bit. Then, thedestination determining circuit 206 obtains an exclusive disjunction ofthe process ID and the cyclic ID of the thread after the change. Then,the destination determining circuit 206 sets, as a transmissiondestination of data in the thread, an arithmetic processing circuit 20having the obtained exclusive disjunction as a transmission destinationprocess ID.

This cyclic ID corresponds to an example of “cyclic number information.”The processing of shifting the position of 1 of the cyclic ID to theright by one corresponds to an example of processing of “shifting thecyclic number information by one.” In addition, the process ID as atransmission destination obtained by an exclusive disjunction by thedestination determining circuit 206 corresponds to an example ofdestination number information obtained by exclusive disjunctionoperation on identification number information assigned to eacharithmetic processing circuit and cyclic number information assigned toeach group. For example, as illustrated in FIG. 12, the destinationdetermining circuit 206 performs exclusive disjunction operation on thecyclic IDs as cyclic number information and the process ID asidentification number information, and calculates the process IDs asdestination number information by the exclusive disjunction operation.Then, the destination determining circuit 206 sets the calculatedprocess IDs as the process IDs of the arithmetic processing circuits 20as transmission destinations in each communication illustrated in FIG.13.

Next, the destination determining circuit 206 shifts the cyclic ID tothe left and selects an arithmetic processing circuit 20 as atransmission destination of data, and may thereby set, as a transmissiondestination of data for each thread, an arithmetic processing circuit 20having data not coinciding with the data retained by each thread. Thearithmetic processing circuit 20 having data not coinciding with thedata retained by each thread corresponds to an example of an “arithmeticprocessing circuit as a complementing counterpart.”

Further, the destination determination processing by the destinationdetermining circuit 206 described here corresponds to an example ofprocessing of “selecting different arithmetic processing circuits astransmission destinations from among adjacent arithmetic processingcircuits based on the destination number information obtained by theexclusive disjunction operation on the identification number informationand the cyclic number information.”

Returning to FIG. 4B, description will be continued. The destinationdetermining circuit 206 outputs information related to the datatransmission destinations of the threads Th1 and Th2 to the datatransmitting circuit 207. The destination determining circuit 206corresponds to an example of a “transmission destination selectingcircuit.”

The data transmitting circuit 207 receives input of information relatedto the transmission target data in the threads Th1 and Th2 from thetarget data determining circuit 205. In addition, the data transmittingcircuit 207 receives information related to the data transmissiondestinations of the threads Th1 and Th2 from the destination determiningcircuit 206.

The data transmitting circuit 207 requests the memory control circuits23 to obtain data strings stored in blocks specified by the obtainedinformation related to the transmission target data in the threads Th1and Th2. Thereafter, the data transmitting circuit 207 receives, fromthe memory control circuits 23, input of the data strings stored in theblocks specified by the obtained information related to the transmissiontarget data in the threads Th1 and Th2. The obtained data istransmission data in the threads Th1 and Th2. Then, the datatransmitting circuit 207 transmits the transmission data in the threadTh1 to the data transmission destination of the thread Th1. In addition,the data transmitting circuit 207 transmits the transmission data in thethread Th2 to the data transmission destination of the thread Th2.

For example, in the first communication during the performance ofHalving, when the process ID of the own arithmetic processing circuit is00, the data transmitting circuit 207 transmits the data strings of theblocks included in the area 304 in FIG. 8 to the arithmetic processingcircuit 20 having the process ID of 01. In addition, the datatransmitting circuit 207 transmits the data strings of the blocksincluded in the area 334 in FIG. 10 to the arithmetic processing circuit20 having the process ID of 10. In addition, in the secondcommunication, when the process ID of the own arithmetic processingcircuit is 00, the data transmitting circuit 207 transmits the datastring of the block included in the area 308 in FIG. 8 to the arithmeticprocessing circuit 20 having the process ID of 10. In addition, the datatransmitting circuit 207 transmits the data string of the block includedin the area 338 in FIG. 10 to the arithmetic processing circuit 20having the process ID of 01. The data transmitting circuit 207corresponds to an example of a “transmitting circuit.”

The data receiving circuit 208 receives input of the information relatedto the reception target data from the target data determining circuit205. Then, the data receiving circuit 208 receives input of data stringstransmitted from other arithmetic processing circuits 20. Here, theinput data strings are arranged in respective blocks. The data receivingcircuit 208 may determine to which block of the reception target dataeach data string corresponds. Thereafter, the data receiving circuit 208requests the memory control circuits 23 to add the reception target datacorresponding to each of the obtained data strings to the respectivedata strings. The data receiving circuit 208 corresponds to an exampleof a “receiving circuit.”

The outside coupling processing performing circuit 209 receives arequest for the processing of Allreduce in the external topology fromthe centralized control circuit 203. Then, the outside couplingprocessing performing circuit 209 performs Allreduce processing with thearithmetic processing circuits 20 forming a hypercube other than thehypercube including the own arithmetic processing circuit. A method ofthe processing of Allreduce in the external topology is not particularlylimited. The outside coupling processing performing circuit 209 mayperform the processing using an existing procedure for the processing ofAllreduce or the like. In addition, the outside coupling processingperforming circuit 209 may request the sub-net manager 6 to makecoupling in the external topology, and perform path management for theprocessing of Allreduce in the external topology. When the outsidecoupling processing performing circuit 209 then completes the Allreduceprocessing with the external topology, the outside coupling processingperforming circuit 209 notifies the centralized control circuit 203 ofcompletion of the processing of Allreduce in the external topology.

The synchronization processing circuit 210 receives a notification ofperformance of synchronization processing in the internal topology fromthe centralized control circuit 203 after completion of performance ofDoubling. Then, the synchronization processing circuit 210 checkswhether or not same data is shared between the arithmetic processingcircuits 20 included in the hypercube. For example, the synchronizationprocessing circuit 210 obtains the data strings stored in the memories24 of the other arithmetic processing circuits 20. Further, thesynchronization processing circuit 210 requests the memory controlcircuits 23 to obtain data strings as a result of operation of theAllreduce processing, the data strings being possessed by the ownarithmetic processing circuit, and obtains the data strings from thememory control circuits 23. Then, the synchronization processing circuit210 checks whether or not the data strings stored in the memories 24 ofthe respective arithmetic processing circuits 20 are the same. When thesame data is supplied, the memory control circuits 23 notify thecentralized control circuit 203 of completion of synchronization in theinternal topology.

The synchronization processing circuit 210 thereafter receives anotification of performance of synchronization processing in theexternal topology from the centralized control circuit 203. Thesynchronization processing circuit 210 checks whether or not same datais shared between the arithmetic processing circuits 20 other than thearithmetic processing circuits 20 included in the hypercube performingthe processing of Allreduce. For example, the synchronization processingcircuit 210 obtains a data string stored in the memory 24 of one of thearithmetic processing circuits 20 set as targets of the processing ofAllreduce on the other system board 1. Then, the synchronizationprocessing circuit 210 checks whether or not the obtained data stringand the data string stored in the memories 24 of the own arithmeticprocessing circuit are the same. When the same data is shared, thememory control circuits 23 notify the centralized control circuit 203 ofcompletion of synchronization in the external topology.

The memory control circuits 23 receive a request to obtain data stringsdetermined as transmission target data from the data transmittingcircuit 207. Then, the memory control circuits 23 obtain the specifieddata strings from the memories 24. Thereafter, the memory controlcircuits 23 output the obtained data strings to the data transmittingcircuit 207.

In addition, the memory control circuits 23 receive input of datastrings as received data from the data receiving circuit 208. Inaddition, the memory control circuits 23 receive, from the datareceiving circuit 208, a request to add reception target datacorresponding to the respective data strings to the respective datastrings. Then, the memory control circuits 23 obtain the receptiontarget data corresponding to the respective data strings from thememories 24. Then, the memory control circuits 23 output the obtainedreception target data and the data strings as received data to theparallel arithmetic circuits 21, and request the addition. Thereafter,the memory control circuits 23 obtain results of the addition of therespective data strings as received data to the corresponding receptiontarget data from the parallel arithmetic circuits 21. Then, the memorycontrol circuits 23 store the results of the addition in the memories 24such that the results of the addition are stored in blocks storing therespective reception target data. Then, when the operation on the dataobtained in one communication is ended, and the storing of the operationresults into the memories 24 is completed, the memory control circuits23 notify completion of the communication to the centralized controlcircuit 203.

Further, the memory control circuits 23 receive, from thesynchronization processing circuit 210, a request to obtain data stringsas operation results of the Allreduce processing, the data strings beingpossessed by the own arithmetic processing circuit. Then, the memorycontrol circuits 23 obtain data strings whose Doubling processing iscompleted from the memories 24. Then, the memory control circuits 23output the obtained data strings to the synchronization processingcircuit 210.

An entire flow of the processing of Allreduce will next be describedwith reference to FIGS. 14A to 14C. FIGS. 14A to 14C illustrateflowchart of a whole of processing of Allreduce. The followingdescription will be made in a case where arithmetic processing circuits20 are included in an n-dimensional hypercube.

The process ID allocating circuit 201 obtains a connection table fromthe network configuration managing circuit 28. Then, the process IDallocating circuit 201 obtains information related to the hypercubeincluding the arithmetic processing circuits 20 from the connectiontable (step S1).

Next, the process ID allocating circuit 201 assigns a unique process IDof n bits to each of the 2^(n) arithmetic processing circuits 20included in the hypercube (step S2). Then, the process ID allocatingcircuit 201 outputs the process IDs assigned to the respectivearithmetic processing circuits 20 to the centralized control circuit203.

The centralized control circuit 203 of each of the arithmetic processingcircuits 20 obtains the process IDs assigned to the respectivearithmetic processing circuits 20 from the process ID allocating circuit201. Then, the centralized control circuit 203 starts parallel Allreduceprocessing for 2^(n) processes (step S3). The centralized controlcircuit 203 outputs the process IDs assigned to the respectivearithmetic processing circuits 20 to the thread generating circuit 202.

The thread generating circuit 202 receives input of the process IDsassigned to the respective arithmetic processing circuits 20 from thecentralized control circuit 203. Then, the thread generating circuit 202obtains the dimensions of the hypercube including the own arithmeticprocessing circuit. The following description will be made in a case ofan n-dimensional hypercube. The thread generating circuit 202 obtainsinformation related to the process assigned to the own arithmeticprocessing circuit from the job managing circuit 27. Then, the threadgenerating circuit 202 generates n threads Th1 to Thn by separating adata string as a result of operation by the process assigned to the ownarithmetic processing circuit into n groups, and assigns identificationnumber information to each of the generated threads (step S4).Thereafter, the thread generating circuit 202 outputs, to thecentralized control circuit 203, information related to the threadswhich information includes the identification number information andinformation related to the data strings included in the respectivethreads Th1 to Thn.

The centralized control circuit 203 obtains the information related tothe threads from the thread generating circuit 202. Then, thecentralized control circuit 203 performs Halving for each of the nthreads Th1 to Thn (steps S5 to S7). Details of Halving will bedescribed later.

After completion of Halving, the centralized control circuit 203performs the processing of Allreduce in the external topology for eachof the n threads Th1 to Thn (steps S8 to S10).

After completion of the processing of Allreduce in the externaltopology, the centralized control circuit 203 makes the outside couplingprocessing performing circuit 209 perform Doubling for each of the nthreads Th1 to Thn (steps S11 to S13). Details of Doubling will bedescribed later.

After completion of the processing of Doubling, the centralized controlcircuit 203 makes the synchronization processing circuit 210 performsynchronization processing between the n threads Th1 to Thn (step S14).

Thereafter, the centralized control circuit 203 makes thesynchronization processing circuit 210 perform synchronizationprocessing between the 2^(n) processes including the process executed bythe own arithmetic processing circuit (step S15). The centralizedcontrol circuit 203 thereby completes the parallel Allreduce processing.

A flow of processing of Halving will next be described with reference toFIG. 15. FIG. 15 is a flowchart of Halving. The flowchart of FIG. 15corresponds to an example of the processing performed in steps S5 to S7in FIG. 14B.

The thread generating circuit 202 assigns a cyclic ID of n bits to eachthread, and thereby initializes the cyclic IDs (step S101). Thecentralized control circuit 203 obtains the cyclic IDs of the respectivethreads from the thread generating circuit 202.

Next, the centralized control circuit 203 sets i=0 (step S102).

Next, the centralized control circuit 203 outputs, to the transmissionand reception data size calculating circuit 204, the process ID of theown arithmetic processing circuit, the information related to thethreads, the cyclic IDs assigned to the respective threads, and the datastring of each block. The transmission and reception data sizecalculating circuit 204 sets the blocks of 2^(n-1) data strings as adata size for transmission and reception in a first communication.Thereafter, the transmission and reception data size calculating circuit204 calculates ½ of the data size in the previous communication as adata size for transmission and reception in a present communication(step S103). Then, the transmission and reception data size calculatingcircuit 204 outputs the data size for transmission and reception to thetarget data determining circuit 205. In addition, the transmission andreception data size calculating circuit 204 outputs the informationrelated to each thread, the cyclic ID assigned to each thread, and thedata string of each block to the target data determining circuit 205.

The target data determining circuit 205 receives, from the transmissionand reception data size calculating circuit 204, input of the data sizefor transmission and reception, the information related to the threadsTh1 and Th2, the cyclic IDs assigned to the threads Th1 and Th2, and thedata string of each block. Next, the target data determining circuit 205generates target determination groups by dividing each thread by thedata size for performing transmission and reception. Then, the targetdata determining circuit 205 assigns transmission and reception targetindexes to the target determination groups, and determines storage areasof reception target data using the process ID and the transmission andreception target indexes (step S104).

Next, the target data determining circuit 205 determines storage areasof transmission target data using the process ID and the transmissionand reception target indexes (step S105). Thereafter, the target datadetermining circuit 205 outputs information related to the storage areasof the transmission target data to the data transmitting circuit 207. Inaddition, the target data determining circuit 205 outputs informationrelated to the storage areas of the reception target data to the datareceiving circuit 208.

In addition, the destination determining circuit 206 receives, from thecentralized control circuit 203, input of the information related toeach of the threads, the cyclic IDs assigned to the threads, and theprocess ID of the own arithmetic processing circuit. Then, thedestination determining circuit 206 obtains, for each thread, anexclusive disjunction of the process ID and the cyclic ID, and obtainsthe process ID of an arithmetic processing circuit 20 as a transmissiondestination for each thread (step S106). Thereafter, the destinationdetermining circuit 206 outputs the process ID of the transmissiondestination arithmetic processing circuit 20 for each thread to the datatransmitting circuit 207.

The data transmitting circuit 207 receives input of the informationrelated to the storage areas of the transmission target data from thetarget data determining circuit 205. In addition, the data transmittingcircuit 207 receives input of the process ID of the transmissiondestination arithmetic processing circuit 20 for each thread from thedestination determining circuit 206. In addition, the data receivingcircuit 208 receives input of the information related to the storageareas of the reception target data from the target data determiningcircuit 205. Then, the data transmitting circuit 207 and the datareceiving circuit 208 perform data transmission and reception (stepS107). At this time, the data receiving circuit 208 outputs receiveddata from other arithmetic processing circuits 20 to the memory controlcircuits 23 together with the information related to the correspondingreception target data. The memory control circuits 23 make the parallelarithmetic circuits 21 perform operation on the received data and thereception target data, and store results of the operation in locationsas the storage areas of the reception target data in the memories 24.Thereafter, the memory control circuits 23 notify the centralizedcontrol circuit 203 of completion of the data reception.

The centralized control circuit 203 receives the notification of thecompletion of the data reception from the memory control circuits 23.Then, the centralized control circuit 203 cyclically shifts the value ofthe cyclic ID assigned to each thread to the left (step S108).

Thereafter, the centralized control circuit 203 determines whether ornot i=n (step S109). When i=n does not hold (step S109: negative), thecentralized control circuit 203 increments i by one (step S110).Thereafter, the centralized control circuit 203 returns to step S103.

When i=n (step S109: affirmative), on the other hand, the centralizedcontrol circuit 203 ends the processing of Halving.

A flow of processing of Doubling will next be described with referenceto FIG. 16. FIG. 16 is a flowchart of Doubling. The flowchart of FIG. 16corresponds to an example of the processing performed in steps S11 toS13 in FIG. 14B.

The centralized control circuit 203 obtains the transmission andreception data size at a time of an end of Halving and the cyclic ID ofeach thread (step S201).

Next, the centralized control circuit 203 sets i=0 (step S202).

Next, the centralized control circuit 203 cyclically shifts the value ofthe cyclic ID of each thread to the right (step S203).

Thereafter, the centralized control circuit 203 outputs, to thetransmission and reception data size calculating circuit 204, theprocess ID of the own arithmetic processing circuit, the informationrelated to the threads, the cyclic IDs assigned to the respectivethreads, and the data string of each block. The transmission andreception data size calculating circuit 204 sets the block of one datastring as a data size for transmission and reception in a firstcommunication. Thereafter, the transmission and reception data sizecalculating circuit 204 calculates the data size for transmission andreception in a present communication by doubling the data size in theprevious communication (step S204). Then, the transmission and receptiondata size calculating circuit 204 outputs the data size for transmissionand reception to the target data determining circuit 205. In addition,the transmission and reception data size calculating circuit 204outputs, to the target data determining circuit 205, the informationrelated to each thread, the cyclic ID assigned to each thread, and thedata string of each block.

The target data determining circuit 205 receives, from the transmissionand reception data size calculating circuit 204, input of the data sizefor transmission and reception, the information related to the threadsTh1 and Th2, the cyclic IDs assigned to the threads Th1 and Th2, and thedata string of each block. Next, the target data determining circuit 205generates target determination groups by dividing each thread by thedata size for performing transmission and reception. Then, the targetdata determining circuit 205 assigns transmission and reception targetindexes to the target determination groups, and determines storage areasof reception target data using the process ID and the transmission andreception target indexes (step S205).

Next, the target data determining circuit 205 determines storage areasof transmission target data using the process ID and the transmissionand reception target indexes (step S206). Thereafter, the target datadetermining circuit 205 outputs information related to the storage areasof the transmission target data to the data transmitting circuit 207. Inaddition, the target data determining circuit 205 outputs informationrelated to the storage areas of the reception target data to the datareceiving circuit 208.

In addition, the destination determining circuit 206 receives, from thecentralized control circuit 203, input of the information related to thethreads, the cyclic IDs assigned to the threads, and the process ID ofthe own arithmetic processing circuit. Then, the destination determiningcircuit 206 obtains, for each thread, an exclusive disjunction of theprocess ID and the cyclic ID, and obtains the process ID of anarithmetic processing circuit 20 as a transmission destination for eachthread (step S207). Thereafter, the destination determining circuit 206outputs the process ID of the transmission destination arithmeticprocessing circuit 20 for each thread to the data transmitting circuit207.

The data transmitting circuit 207 receives input of the informationrelated to the storage areas of the transmission target data from thetarget data determining circuit 205. In addition, the data transmittingcircuit 207 receives input of the process ID of the transmissiondestination arithmetic processing circuit 20 for each thread from thedestination determining circuit 206. In addition, the data receivingcircuit 208 receives input of the information related to the storageareas of the reception target data from the target data determiningcircuit 205. Then, the data transmitting circuit 207 and the datareceiving circuit 208 perform data transmission and reception (stepS208). At this time, the data receiving circuit 208 outputs receiveddata from other arithmetic processing circuits 20 to the memory controlcircuits 23 together with the information related to the correspondingreception target data. The memory control circuits 23 make the parallelarithmetic circuits 21 perform operation on the received data and thereception target data, and store results of the operation in locationsas the storage areas of the reception target data in the memories 24.Thereafter, the memory control circuits 23 notify the centralizedcontrol circuit 203 of completion of the data reception.

The centralized control circuit 203 receives the notification of thecompletion of the data reception from the memory control circuits 23.Then, the centralized control circuit 203 determines whether or not i=n(step S209). When i=n does not hold (step S209: negative), thecentralized control circuit 203 increments i by one (step S210).Thereafter, the centralized control circuit 203 returns to step S203.

When i=n (step S209: affirmative), on the other hand, the centralizedcontrol circuit 203 ends the processing of Doubling.

A program for performing Halving and Doubling will further be described.FIG. 17 is a diagram illustrating an example of pseudocode of a programfor performing Halving and Doubling.

CyclicLeftShift (val) in FIG. 17 is a function that cyclically shiftsthe value of val to the left. In addition, CyclicRightShift (val) is afunction that cyclically shifts the value of val to the right. Further,SendRecv (dst, scr, size, peer) is a function that sends a size amountto a peer from the address of scr, and overwrites dst with an amount ofdata sent from the peer. In addition, SendRecv_Add (dst, scr, size,peer) is a function that sends a size amount to a peer from the addressof scr, and adds an amount of data sent from the peer to dst. Inaddition, StepHist is a value indicating a progress state of theprocessing of Halving or Doubling being performed.

A flow of processing with regard to Halving of the program illustratedin FIG. 17 will next be described with reference to FIGS. 18A and 18B.FIGS. 18A and 18B illustrate flowchart of processing performed in aprogram implementing Halving. FIGS. 18A and 18B correspond to an exampleof a case where the processing of the flowchart of FIG. 15 isimplemented by a program. In the following, description will be made ofeach piece of processing supposing that the arithmetic circuit possessedby the network control circuit 29 is an operating entity. In addition,in the following, threadID represented by a value of 0 to n−1 is used asa thread identifier.

The arithmetic circuit generates the cyclic ID (CyclicID) to be assignedto each thread by shifting the position of a value of one in the cyclicID of n bits having a single value of one to the left at a time, andthereby initializes CyclicID. In addition, the arithmetic circuit setsStepHist=0. Further, the arithmetic circuit multiplies threadID by avalue obtained by dividing the data size (D) of an entire result ofprocess operation by n. Next, the arithmetic circuit adds the headaddress (MemAddr) of a memory 24 to the multiplication result, therebycalculates an offset of each thread, and thus initializes the offset(step S301).

Next, the arithmetic circuit sets i=0 (step S302).

Next, the arithmetic circuit adds a bit of interest to the leastsignificant position of StepHist (step S303). For example, thearithmetic circuit shifts the bit string of StepHist to the left by one.In this case, the arithmetic circuit sets a least significant bit tozero. Then, the arithmetic circuit determines whether a same bit as abit set to one in CyclicID is also one in the process ID (processID).For example, the arithmetic circuit obtains a logic product of thecyclic ID and the process ID. The arithmetic circuit determines whetheror not the obtained value is larger than zero. When the obtained valueis larger than zero, the arithmetic circuit determines that the bit isalso one in the process ID. When the obtained value is zero or smaller,the arithmetic circuit determines that the bit is zero in the processID. Then, when the bit is also one in the process ID, the arithmeticcircuit sets the least significant bit of StepHist to one by obtainingan exclusive disjunction of StepHist and a bit string that has a samenumber of bits and whose least significant bit is one. When the bit iszero in the process ID, on the other hand, the arithmetic circuit setsthe least significant bit of StepHist to zero by obtaining an exclusivedisjunction of StepHist and a bit string that has a same number of bitsand whose bits are all zero.

Next, the arithmetic circuit calculates a data size for performingtransmission and reception (step S304). For example, the arithmeticcircuit obtains BufferSize by dividing, by two raised to the power of i,a value obtained by dividing the data size of the entire result ofprocess operation by n. This BufferSize is the data size for performingtransmission and reception.

Next, the arithmetic circuit determines the head address of data as areception target (step S305). For example, the arithmetic circuitcalculates RecvAdd as the head address of data as a reception target byadding the offset to a value obtained by multiplying StepHist byBufferSize.

For example, the blocks of each thread are divided into 2×2^(i). Then,the arithmetic circuit assigns a transmission and reception target indexof 1+i bits to each divided area of the thread. In addition, at thispoint in time, StepHist also has a value similarly represented by 1+ibits. Accordingly, the arithmetic circuit sets an area in which thetransmission and reception target index coincides with the value ofStepHist as the area of data as a reception target. Then, the arithmeticcircuit obtains the head address of data as a reception target byobtaining BufferSize multiplied by the index.

Next, the arithmetic circuit determines the head address of data as atransmission target (step S306). For example, an area having atransmission and reception target index whose least significant bit isdifferent with respect to the head address of the reception destinationis the area of data as a transmission target. Accordingly, thearithmetic circuit obtains an exclusive disjunction of StepHist and avalue that has a same number of bits as StepHist and whose leastsignificant bit is one. Then, the arithmetic circuit multiplies thevalue of the obtained exclusive disjunction by BufferSize, and adds theoffset to a result of the multiplication. The arithmetic circuit therebycalculates SendAdd as the head address of data as a transmission target.

Next, the arithmetic circuit determines the process ID of an arithmeticprocessing circuit 20 as a transmission destination (step S307). Forexample, the arithmetic circuit calculates an exclusive disjunction ofthe cyclic ID and the process ID, and sets a result of the calculationas Peer, which is the process ID of an arithmetic processing circuit 20as a transmission and reception destination.

Thereafter, the arithmetic circuit performs data transmission andreception and processing of addition of received data (step S308). Forexample, the arithmetic circuit obtains SendRecv_Add (RecvAdd, SendAdd,BufferSize, Peer).

Next, the arithmetic circuit cyclically shifts the value of the cyclicID to the left by executing CyclicLeftShift (CycicID) (step S309).

Then, the arithmetic circuit determines whether or not i is smaller thann (step S310). When i is smaller than n (step S310: affirmative), thearithmetic processing circuit increments the value of i by one (stepS311), and returns to step S303.

When i=n (step S310: negative), on the other hand, the arithmeticprocessing circuit ends Halving.

A flow of processing with regard to Doubling of the program illustratedin FIG. 17 will next be described with reference to FIGS. 19A and 19B.FIGS. 19A and 19B illustrate flowchart of processing performed in aprogram implementing Doubling. FIGS. 18A and 18B correspond to anexample of a case where the processing of the flowchart of FIG. 16 isimplemented by a program. Also in the following, description will bemade of each piece of processing supposing that the arithmetic circuitpossessed by the network control circuit 29 is an operating entity.

The arithmetic circuit obtains BufferSize, the cyclic ID, StepHist, andthe offset at a time of an end of Halving (step S401).

Next, the arithmetic circuit sets i=0 (step S402).

Next, the arithmetic circuit cyclically shifts the value of the cyclicID to the right by performing CyclicRightShift (CyclicID) (step S403).For example, the arithmetic circuit exchanges data with communicationdestinations in reverse order to that of Halving. Halving performs ashift to the left once after performing final data transmission andreception. Thus, here, the arithmetic circuit performs a shift to theright first.

Next, the arithmetic circuit calculates the head address of data as areception target (step S404). For example, in Doubling, the areas oftransmission and reception target data are opposite from those ofHalving. Accordingly, the arithmetic circuit obtains an exclusivedisjunction of StepHist and a value that has a same number of bits asStepHist and whose least significant bit is one. Then, the arithmeticcircuit multiplies the value of the obtained exclusive disjunction byBufferSize, and adds the offset to a result of the multiplication. Thearithmetic circuit thereby calculates RecvAdd, which is the head addressof data as a reception target.

Next, the arithmetic circuit determines the head address of data as atransmission target (step S405). For example, the arithmetic circuitcalculates SendAdd as the head address of data as a transmission targetby adding the offset to a value obtained by multiplying StepHist byBufferSize.

Next, the arithmetic circuit determines the process ID of an arithmeticprocessing circuit 20 as a transmission destination (step S406). Forexample, the arithmetic circuit calculates an exclusive disjunction ofthe cyclic ID and the process ID, and sets a result of the calculationas Peer, which is the process ID of an arithmetic processing circuit 20as a transmission and reception destination.

Next, the arithmetic circuit calculates a data size for performingtransmission and reception at a time of a next communication (stepS407). For example, the arithmetic circuit sets twice a data size in apresent communication as the data size for performing transmission andreception at the time of the next communication.

Thereafter, the arithmetic circuit performs data transmission andreception and processing of addition of received data (step S408). Forexample, the arithmetic circuit obtains SendRecv (RecvAdd, SendAdd,CyclicID, BufferSize, Peer).

Next, the arithmetic circuit shifts the value of StepHist to the right(step S409). The shift to the right in this case is not a cyclic rightshift. The shift to the right causes the value of the first bit ofStepHist to disappear. The arithmetic circuit thereby makes the value ofStepHist change in reverse order to that of Halving.

Then, the arithmetic circuit determines whether or not i is smaller thann (step S410). When i is smaller than n (step S410: affirmative), thearithmetic processing circuit increments the value of i by one (stepS411), and returns to step S403.

When i=n (step S410: negative), on the other hand, the arithmeticprocessing circuit ends Doubling.

As described above, the information processing system according to thepresent embodiment performs the processing of Allreduce by performingdata transmission and reception to and from arithmetic processingcircuits different for respective threads generated by dividing aprocess by the dimensions of a hypercube including the arithmeticprocessing circuits. Thus, in performing the processing of Allreduce, itis possible to suppress occurrence of a communication path not used at atime of communication, and improve processing speed.

FIG. 20 is a diagram of assistance in explaining effects of processingof Allreduce according to the first embodiment. In FIG. 20, an axis ofordinates indicates a multiplying factor of communication time withrespect to existing Allreduce processing, and an axis of abscissasindicates the number of dimensions of the hypercube.

Description will be made in a case where the band of one cable is bbytes/second. In this case, a time taken to communicate data of D bytesin one path is D/b seconds. Therefore, when the communication amount ofthe data is halved, the communication time is D/2b. For example, in thecase where arithmetic processing circuits have the configuration of atwo-dimensional hypercube, the size of a thread is half of the whole,thus halving the data size for transmission and reception, and halvingthe communication time. Then, communication is performed in a doublepath at a time, so that the band is 2b bytes/second. In addition, in thecase of the configuration of an n-dimensional hypercube, the band is nbbytes/second, and the data size for transmission and reception in eachpath is D/n bytes. Hence, as illustrated in FIG. 20, when the processingof Allreduce according to the present application is performed,processing speed is improved as compared with the existing processing ofAllreduce.

In addition, when the processing of Allreduce according to the presentembodiment is performed in a topology in which arithmetic processingcircuits are coupled to each other by a network switch, a plurality ofpieces of data are transmitted simultaneously on one network extendingfrom each of the arithmetic processing circuits to the network switch.Therefore, a load concentrates on paths leading to the network switch,and a decrease in speed may occur. This is also true for a case where amulti-stage network switch is used. Therefore, the informationprocessing system performing the processing of Allreduce according tothe present embodiment may exert more effect in a case where arithmeticprocessing circuits have a direct connection network without theintervention of a switch.

Second Embodiment

FIGS. 21A and 21B illustrate diagrams of a hardware configuration of aninformation processing system according to a second embodiment. Theinformation processing system according to the present embodiment haseight arithmetic processing circuits 20 mounted on a system board 1. Thearithmetic processing circuits 20 are coupled so as to form athree-dimensional hypercube.

As in the first embodiment, the arithmetic processing circuits 20according to the present embodiment also have the hardware illustratedin FIGS. 3A and 3B. Further, a block diagram of the network controlcircuit 29 according to the present embodiment is illustrated in FIGS.4A and 4B. In the following description, description of functions ofparts similar to those of the first embodiment may be omitted.

Because the arithmetic processing circuits 20 are included in thethree-dimensional hypercube, the thread generating circuit 202 accordingto the present embodiment generates three threads Th1 to Th3 for oneprocess, as illustrated in FIG. 22. FIG. 22 is a diagram of assistancein explaining destination determination processing according to thesecond embodiment.

The transmission and reception data size calculating circuit 204determines a data size for transmission and reception as in the firstembodiment. The transmission and reception data size is the size of datastrings stored in blocks enclosed by thick frames in FIG. 22. Forexample, the data size becomes half the size in each communication.

In addition, the target data determining circuit 205 assignstransmission and reception target indexes to generated targetdetermination groups, and determines an area as a reception target andan area as a transmission target as in the first embodiment. Then, thetarget data determining circuit 205 sets data corresponding to thedetermined areas as transmission target data and reception target data.

In addition, as illustrated in FIG. 22, the destination determiningcircuit 206 obtains an exclusive disjunction of a process ID and acyclic ID for each thread, and sets the value of the exclusivedisjunction as the process ID of a transmission destination in thethread. FIG. 22 illustrates the determination of transmissiondestination process IDs in each communication by an arithmeticprocessing circuit 20 assigned 000 as a process ID thereof. In FIG. 22,a cyclic ID is provided to each thread in each communication. Then, atransmission destination process ID obtained from an exclusivedisjunction of the process ID and the cyclic ID is provided at a head ofan arrow extending from the cyclic ID.

In this case, as illustrated in FIG. 23, the transmission destinationarithmetic processing circuits 20 change in each communication. FIG. 23is a diagram of assistance in explaining destination transitions in athree-dimensional hypercube. Respective circles in FIG. 23 represent thearithmetic processing circuits 20, and numbers provided to therespective circles represent the process IDs of the respectivearithmetic processing circuits 20.

In the first communication, as indicated by a solid line arrow 411, thearithmetic processing circuit 20 having the process ID of 000 sets anarithmetic processing circuit 20 having a process ID of 001 as the datatransmission destination of the thread Th1. In addition, as indicated bya broken line arrow 412, the arithmetic processing circuit 20 having theprocess ID of 000 sets an arithmetic processing circuit 20 having aprocess ID of 010 as the data transmission destination of the threadTh2. Further, as indicated by a dash-single-dot line arrow 413, thearithmetic processing circuit 20 having the process ID of 000 sets anarithmetic processing circuit 20 having the process ID of 100 as thedata transmission destination of the thread Th3.

Next, in the second communication, as indicated by a solid line arrow414, the arithmetic processing circuit 20 having the process ID of 000sets the arithmetic processing circuit 20 having the process ID of 010as the data transmission destination of the thread Th1. In addition, asindicated by a broken line arrow 415, the arithmetic processing circuit20 having the process ID of 000 sets the arithmetic processing circuit20 having the process ID of 100 as the data transmission destination ofthe thread Th2. Further, as indicated by a dash-single-dot line arrow416, the arithmetic processing circuit 20 having the process ID of 000sets the arithmetic processing circuit 20 having the process ID of 001as the data transmission destination of the thread Th3.

Next, in the third communication, as indicated by a solid line arrow417, the arithmetic processing circuit 20 having the process ID of 000sets the arithmetic processing circuit 20 having the process ID of 100as the data transmission destination of the thread Th1. In addition, asindicated by a broken line arrow 418, the arithmetic processing circuit20 having the process ID of 000 sets the arithmetic processing circuit20 having the process ID of 001 as the data transmission destination ofthe thread Th2. Further, as indicated by a dash-single-dot line arrow419, the arithmetic processing circuit 20 having the process ID of 000sets the arithmetic processing circuit 20 having the process ID of 010as the data transmission destination of the thread Th3.

Thus, the arithmetic processing circuit 20 sets different arithmeticprocessing circuits 20 for the respective threads as transmissiondestinations of data in one communication. The arithmetic processingcircuit 20 may therefore transmit the data to all of adjacent arithmeticprocessing circuits 20 in one communication. It is thus possible tosuppress occurrence of an unused path, and make full use of a band.

In summary, as illustrated in FIG. 22, the destination determiningcircuit 206 performs exclusive disjunction operation on the cyclic IDsas cyclic number information and the process ID as identification numberinformation, and calculates process IDs as destination numberinformation by the exclusive disjunction operation. Then, thedestination determining circuit 206 sets the calculated process IDs asthe process IDs of transmission destination arithmetic processingcircuits 20 in each communication illustrated in FIG. 23.

FIG. 24 is a diagram illustrating arithmetic processing circuitsperforming communication in each communication in an informationprocessing system according to the second embodiment. In FIG. 24, thearithmetic processing circuits 20 performing communication arerepresented by assigning symbols #00 to #07 to the eight respectivearithmetic processing circuits 20 coupled so as to form athree-dimensional hypercube. In this case, the information processingsystem performs Allreduce processing with a data amount of 24 m.

In FIG. 24, communications to which no pattern is set represent sets ofarithmetic processing circuits 20 transmitting and receiving data of thethread Th1. In addition, communications to which a hatching pattern isset represent sets of arithmetic processing circuits 20 transmitting andreceiving data of the thread Th2. Further, communications to which a dotpattern is set represent sets of arithmetic processing circuits 20transmitting and receiving data of the thread Th3.

Because Halving is performed, a communication amount is 4 m in the firstcommunication, the communication amount is 2 m, which is half of 4 m, inthe second communication, and the communication amount is 1 m in thethird communication. Thereafter, because Doubling is performed, thecommunication amount is 2 m in a fourth communication, and thecommunication amount is 4 m in a fifth communication.

As illustrated in FIG. 24, the arithmetic processing circuits #00 to #07each perform communication with adjacent arithmetic processing circuits20 in each communication. Then, the arithmetic processing circuits #00to #07 perform the processing of Allreduce by changing the arithmeticprocessing circuits 20 as communication destinations for each thread ineach communication.

As described above, even when the arithmetic processing circuits arecoupled so as to form a three-dimensional hypercube, it is possible tosuppress occurrence of a path not used for communication, and improveprocessing speed.

Third Embodiment

FIGS. 25A and 25B illustrate diagrams of a hardware configuration of aninformation processing system according to a third embodiment. Theinformation processing system according to the present embodiment hastwo arithmetic processing circuits 20 mounted on one system board 1.Then, the arithmetic processing circuits 20 are coupled so as to form aone-dimensional hypercube. Then, four system boards 1 are interconnectedvia a network switch 5.

Also in this case, the transmission and reception data size calculatingcircuit 204 determines the data size of transmission and receptiontargets in each communication. Then, the target data determining circuit205 determines transmission target data and reception target dataaccording to the data size. However, in the case of a one-dimensionalhypercube, Halving is ended in one communication. Similarly, Doubling isended in one communication.

FIG. 26 is a diagram illustrating arithmetic processing circuitsperforming communication in each communication in an informationprocessing system according to the third embodiment. In FIG. 26, thearithmetic processing circuits 20 performing communication arerepresented by assigning symbols #0 to #7 to eight respective arithmeticprocessing circuits 20 coupled so as to form a one-dimensional hypercubeon four system boards 1. In this case, the information processing systemperforms Allreduce processing with a data amount of 24 m.

In this case, the Allreduce processing is performed between the systemboards 1. For example, data is mutually exchanged between the foursystem boards 1. When the system boards 1 set as targets of Allreduceare thus increased, the processing of Allreduce via the network switch 5is increased.

As described above, even when a coupling is made so as to form aone-dimensional hypercube, it is possible to suppress occurrence of apath not used for communication, and improve processing speed. However,an increase in system boards increases the Allreduce processing via thenetwork switch. Thus, it is more desirable to reduce the processing ofAllreduce using the network switch by increasing directly coupledarithmetic processing circuits.

Here, in the foregoing embodiments, description has been made of a casewhere one of arithmetic processing circuits 20 included in a hypercubeperforms process ID assignment. However, there is no limitation to this.For example, an arithmetic processing circuit 20 for process IDassignment may be disposed.

All examples and conditional language provided herein are intended forthe pedagogical purposes of aiding the reader in understanding theinvention and the concepts contributed by the inventor to further theart, and are not to be construed as limitations to such specificallyrecited examples and conditions, nor does the organization of suchexamples in the specification relate to a showing of the superiority andinferiority of the invention. Although one or more embodiments of thepresent invention have been described in detail, it should be understoodthat the various changes, substitutions, and alterations could be madehereto without departing from the spirit and scope of the invention.

What is claimed is:
 1. An information processing system comprising aplurality of information processing apparatuses, each of the pluralityof information processing apparatuses incorporating a plurality ofarithmetic processing circuits, each of the plurality of arithmeticprocessing circuits includes: a dividing circuit configured to divide aplurality of data blocks retained by the arithmetic processing circuitinto groups of a number equal to the number of the plurality ofarithmetic processing circuits included in the information processingapparatus including the own device, a data selecting circuit configuredto select respective first data blocks from the plurality of data blocksincluded in the respective groups, a transmission destination selectingcircuit configured to select arithmetic processing circuits differentfrom each other as respective transmission destinations from theplurality of arithmetic processing circuits included in the informationprocessing apparatus for the respective first data blocks selected bythe data selecting circuit based on destination number informationobtained by exclusive disjunction operation on identification numberinformation assigned to each arithmetic processing circuit and cyclicnumber information assigned to each group, and a transmitting circuitconfigured to transmit the respective first data blocks selected by thedata selecting circuit to the respective arithmetic processing circuitsselected by the transmission destination selecting circuit.
 2. Theinformation processing system according to claim 1, wherein thetransmission destination selecting circuit selects, as the transmissiondestinations, the arithmetic processing circuits different from eachother from arithmetic processing circuits adjacent to the own deviceamong the plurality of arithmetic processing circuits included in theinformation processing apparatus.
 3. The information processing systemaccording to claim 2, wherein each of the arithmetic processing circuitsis assigned the identification number information represented by thenumber of bits, the number of bits being the number of arithmeticprocessing circuits adjacent to the own device, and the transmissiondestination selecting circuit assigns the cyclic number informationrepresented by the number of bits to each group, performs exclusivedisjunction operation on the identification number information assignedto the own device and the cyclic number information, and selects anarithmetic processing circuit assigned identification number informationcoinciding with destination number information obtained by the exclusivedisjunction operation as the transmission destination of the groupassigned the cyclic number information used in the exclusive disjunctionoperation.
 4. The information processing system according to claim 3,wherein the arithmetic processing circuit further includes a receivingcircuit configured to receive a first data block transmitted fromanother arithmetic processing circuit, performs operation using a givendata block corresponding to the received first data block among theplurality of data blocks and the received first data block, and set aresult of the operation as data of the given data block.
 5. Theinformation processing system according to claim 4, wherein thetransmission destination selecting circuit shifts the cyclic numberinformation by one after the transmitting of the first data blocks bythe transmitting circuit and storing of the result of the operation bythe receiving circuit are completed.
 6. The information processingsystem according to claim 5, wherein the data selecting circuit selectssecond data blocks from the respective groups after the transmitting ofthe first data blocks by the transmitting circuit and the storing of theresult of the operation by the receiving circuit are completed, byperforming exclusive disjunction operation on the identification numberinformation and the cyclic number information shifted by one, thetransmission destination selecting circuit selects arithmetic processingcircuits different from each other for the respective second data blocksselected by the data selecting circuit and different from thetransmission destinations of the first data blocks in the groupsincluding the respective second data blocks as transmission destinationsfrom among the plurality of arithmetic processing circuits included inthe information processing apparatus, and the transmitting circuittransmits the respective second data blocks selected by the dataselecting circuit to the arithmetic processing circuits selected by thetransmission destination selecting circuit.
 7. The informationprocessing system according to claim 6, wherein the transmissiondestination selecting circuit selects, as the transmission destinations,arithmetic processing circuits retaining third data blocks ascomplementing counterparts of the second data blocks.
 8. The informationprocessing system according to claim 1, wherein the data blocks are adata group obtained by dividing data retained by the arithmeticprocessing circuit into a value obtained by multiplying a numberresulting from raising two to a power of the number of adjacentarithmetic processing circuits by the number of the adjacent arithmeticprocessing circuits.
 9. The information processing system according toclaim 1, wherein the arithmetic processing circuits are each assignedthe identification number information represented by given bits, andarithmetic processing circuits different from each other in a value ofone bit in the identification number information are coupled to eachother.
 10. An arithmetic processing circuit comprising: a dividingcircuit configured to divide a plurality of data blocks retained by thearithmetic processing circuit into groups of a number equal to thenumber of arithmetic processing circuits included in an informationprocessing apparatus incorporating the own arithmetic processingcircuit; a data selecting circuit configured to select respective firstdata blocks from the plurality of data blocks included in the respectivegroups; a transmission destination selecting circuit configured toselect arithmetic processing circuits different from each other asrespective transmission destinations from the plurality of arithmeticprocessing circuits included in the information processing apparatus forthe respective first data blocks selected by the data selecting circuitbased on destination number information obtained by exclusivedisjunction operation on identification number information assigned toeach arithmetic processing circuit and cyclic number informationassigned to each group; and a transmitting circuit configured totransmit the respective first data blocks selected by the data selectingcircuit to the respective arithmetic processing circuits selected by thetransmission destination selecting circuit.
 11. The arithmeticprocessing circuit according to claim 10, wherein the transmissiondestination selecting circuit selects, as the transmission destinations,the arithmetic processing circuits different from each other fromarithmetic processing circuits adjacent to the own device among theplurality of arithmetic processing circuits included in the informationprocessing apparatus.
 12. The arithmetic processing circuit according toclaim 11, wherein each of the arithmetic processing circuits is assignedthe identification number information represented by the number of bits,the number of bits being the number of arithmetic processing circuitsadjacent to the own device, and the transmission destination selectingcircuit assigns the cyclic number information represented by the numberof bits to each group, performs exclusive disjunction operation on theidentification number information assigned to the own device and thecyclic number information, and selects an arithmetic processing circuitassigned identification number information coinciding with destinationnumber information obtained by the exclusive disjunction operation asthe transmission destination of the group assigned the cyclic numberinformation used in the exclusive disjunction operation.
 13. Thearithmetic processing circuit according to claim 12, wherein thearithmetic processing circuit further includes a receiving circuitconfigured to receive a first data block transmitted from anotherarithmetic processing circuit, performs operation using a given datablock corresponding to the received first data block among the pluralityof data blocks and the received first data block, and set a result ofthe operation as data of the given data block.
 14. The arithmeticprocessing circuit according to claim 13, wherein the transmissiondestination selecting circuit shifts the cyclic number information byone after the transmitting of the first data blocks by the transmittingcircuit and storing of the result of the operation by the receivingcircuit are completed.
 15. The arithmetic processing circuit accordingto claim 14, wherein the data selecting circuit selects second datablocks from the respective groups after the transmitting of the firstdata blocks by the transmitting circuit and the storing of the result ofthe operation by the receiving circuit are completed, by performingexclusive disjunction operation on the identification number informationand the cyclic number information shifted by one, the transmissiondestination selecting circuit selects arithmetic processing circuitsdifferent from each other for the respective second data blocks selectedby the data selecting circuit and different from the transmissiondestinations of the first data blocks in the groups including therespective second data blocks as transmission destinations from amongthe plurality of arithmetic processing circuits included in theinformation processing apparatus, and the transmitting circuit transmitsthe respective second data blocks selected by the data selecting circuitto the arithmetic processing circuits selected by the transmissiondestination selecting circuit.
 16. The arithmetic processing circuitaccording to claim 15, wherein the transmission destination selectingcircuit selects, as the transmission destinations, arithmetic processingcircuits retaining third data blocks as complementing counterparts ofthe second data blocks.
 17. The arithmetic processing circuit accordingto claim 10, wherein the data blocks are a data group obtained bydividing data retained by the arithmetic processing circuit into a valueobtained by multiplying a number resulting from raising two to a powerof the number of adjacent arithmetic processing circuits by the numberof the adjacent arithmetic processing circuits.
 18. The arithmeticprocessing circuit according to claim 10, wherein the arithmeticprocessing circuits are each assigned the identification numberinformation represented by given bits, and arithmetic processingcircuits different from each other in a value of one bit in theidentification number information are coupled to each other.
 19. Acontrol method for an information processing system including aplurality of information processing apparatuses, each of the pluralityof information processing apparatuses incorporating a plurality ofarithmetic processing circuits, each of the plurality of arithmeticprocessing circuits including a dividing circuit, a data selectingcircuit, a transmission destination selecting circuit, and atransmitting circuit, the method comprising: dividing, by the dividingcircuit, a plurality of data blocks retained by the arithmeticprocessing circuit into groups of a number equal to the number of theplurality of arithmetic processing circuits included in the informationprocessing apparatus including the own device; selecting, by the dataselecting circuit, respective first data blocks from the plurality ofdata blocks included in the respective groups; selecting, by thetransmission destination selecting circuit, arithmetic processingcircuits different from each other as respective transmissiondestinations from the plurality of arithmetic processing circuitsincluded in the information processing apparatus for the respectivefirst data blocks selected by the data selecting circuit based ondestination number information obtained by exclusive disjunctionoperation on identification number information assigned to eacharithmetic processing circuit and cyclic number information assigned toeach group; and transmitting, by the transmitting circuit, therespective first data blocks selected by the data selecting circuit tothe respective arithmetic processing circuits selected by thetransmission destination selecting circuit.